current position：Home>User practice | reasoning performance optimization of deep learning model based on megengine mobile terminal CPU
User practice | reasoning performance optimization of deep learning model based on megengine mobile terminal CPU
2022-01-26 23:50:22 【MegEngine】
User Practice Series , Will be included MegEngine Users' experience in the process of framework practice , I hope I can help my friends who have the same use scenario , Better understand and use MegEngine ~
author ： Wang lei | Kuang depending on science and technology R & D Engineer
With the development of artificial intelligence technology and the continuous expansion of application fields , Mobile devices with weak computing power have become an important computing carrier of model reasoning , Optimizing its reasoning performance has become an important engineering problem . It is generally believed , Let the model run on GPU It will run on CPU It has great advantages in , Achieve significant performance improvements . This is usually the truth , however , In engineering practice, we also find , For some models with smaller dimensions , On mobile devices ,GPU Running does not bring performance improvement , It also introduces the problem of compatibility . therefore , In some application scenarios , We need to CPU It is the carrier of operation , Try everything , To improve the reasoning performance of the model .
In the engineering practice of optimizing the reasoning performance of a key point model , be based on MegEngine Inference engine , It is found that two optimization methods are more effective ,NCHW44 and Record. This paper will explain their principle and application method in detail .
as everyone knows , Increasing the degree of parallelism is an important means to improve the computing speed . stay CPU On , That's what you need to use SIMD Instructions ——Single Instruction, Multiple Data, Single instruction multiple data , That is, execute a single instruction , The operation completes the operation of multiple data . For example, performing an addition operation , If not SIMD Instruction is a general addition instruction , You can only operate on one number at a time , In model reasoning , This number is often 8 position 、16 position , The biggest is just 32 Floating point number of bits , This is for modern 64 Bit register , It's really a waste . If you store multiple numbers in a register , One instruction completes the operation , It can double the calculation speed . stay x86 CPU On ,SIMD The implementation of SSE、AVX And so on , And in the ARM CPU On , It is NEON Instruction set . and CPU It also provides SIMD Special registers for instructions , stay x86 On the platform , The number of register bits is 128 position 、256 position , Even 512 position , stay ARM On the platform , The number of register bits is 128 position , So you can do it all at once 4 individual float32 The operation of data . therefore , If we can find a way to use as much as possible in model reasoning SIMD, Can improve the performance of reasoning .
Let's look at the use of... In model reasoning SIMD What's the problem . Usually , The tensor is stored in memory as NCHW（ That is, the row and column data of each channel are arranged continuously , Then store each channel in sequence ）, For example, when dealing with common convolution operations , The size of convolution kernel may vary , such as 3x3, Then you need to take one line at a time 3 A continuous pixel data is multiplied by the corresponding position data of the convolution kernel （ Then process other columns and channels ）, And the corresponding SIMD Instructions , The registers used are usually 128 position , Use float32 It also needs to be handled once 4 Only one data can give full play to its advantages , These four data must be in adjacent positions in memory , So this calculation method greatly limits SIMD Instructions .
As an improvement , stay NCHW44（ Also known as NC4HW4） Under the layout , Same location （HW） Of 4 The data of two channels are arranged together continuously , In the convolution operation, they participate in the calculation together , Every time SIMD Instruction execution can load them together into registers , This improves the computational efficiency . The picture below shows NCHW44 Data storage arrangement .
MegEngine Support two ways to use NCHW44 Optimize ：
1. offline dump（ serialize ） become NCHW44 Model , Inference time MegEngine Will automatically determine its arrangement , Execute the corresponding operator implementation . The following two are for dump Methods
megengine.jit.trace.dump megengine.core.tensor.megbrain_graph.optimize_for_inference Copy code
Keyword parameters are supported enable_nchw44, Set the parameter value to True, The output is NCHW44 Model of .
Corresponding , If you want to pass load_and_run Pre test performance , Can be used in sdk/load-and-run/dump_with_testcase_mge.py Adding parameters when scripting —enable-nchw44, The generated model can be load_and_run Load the executed nchw44 Model .
2. Switch on online ,dump Don't do... When modeling nchw44 To configure , Run through option Turn on the switch ：
serialization::GraphLoader::LoadConfig load_config; load_config.comp_graph = ComputingGraph::make(); auto &&graph_opt = ret.load_config.comp_graph->options(); graph_opt.graph_opt.enable_nchw44(); Copy code
Corresponding , If you want to pass load_and_run Pre test performance , Can be implemented load_and_run when , Add command line arguments —enable-nchw44.
The two methods can be selected in combination with specific use conditions ： If we develop sdk or app Multiple models may be loaded , Some use NCHW44 And some don't use , It is more suitable to choose Offline Mode ; If for some reason , We can't re dump Model （ For example, the original model file is missing ）, You can only choose online mode .
In our engineering practice , A model is in the current mainstream android Reasoning speed on mobile phones , There are about 20%-30% About improvement .
When MegEngine When reasoning , The underlying execution is a static diagram , Its execution sequence is deterministic . For each operator in the graph , The implementation is divided into two steps ： Get ready kernel And actual implementation . In the preparation kernel Stage ,MegEngine Will be based on filter size、stride、shape Such information determines the algorithm to be executed , That is, select the function to execute , namely kernel（ For convolution , There may be many different implementations ）. In the execution phase , Then actually call these functions .
If you choose the required basis, it remains unchanged （ The actual situation is mainly shape Don't change ）, So this preparation kernel The process only needs to be performed once , And record the selected function objects in a list , When you execute it later , Take function objects directly and sequentially from the list , Execution can be . In this way, the preparation for subsequent implementation is saved kernel Time for . This is the same. record The meaning of the name .
at present MegEngine There are two levels of record.record1 Mainly to speed up the implementation of , The principle is described above ;record2 Mainly to save memory , If shape unchanged ,MegEngine You can analyze some information stored on the composition （ This information can be found in shape Used to do... When changing shape The derivation of ）. For scenarios where we want to improve computing performance , commonly record1 More appropriate .
Be careful record One of the most important limitations of is shape Can't change . For some detection models , Depending on the size of the input drawing , On the model resize, In this case, you can't use record. For the model with constant input length, width and number of channels , Still need attention ,batch Parameters （ namely NCHW Medium N） Can't change , This may be overlooked . in addition , After the model is loaded , Before the first run , We can still change shape Of , As long as it doesn't change after the first run shape, It doesn't affect record Use .
except shape Beyond this condition , There are also some restrictions ：
All operators cannot rely on dynamic memory allocation , Because the recorded function object also contains input and output pointers , Dynamic memory changes ;
Host The input / output pointer of the terminal cannot be changed ;
Synchronization can only occur at the end of network execution , That is, it cannot be executed in the network , Perform synchronization at an intermediate node ;
There cannot be more than one... In the whole graph compnode.
These conditions are for general use , It can basically satisfy .
stay option In the open
serialization::GraphLoader::LoadConfig load_config; load_config.comp_graph = ComputingGraph::make(); auto &&graph_opt = load_config.comp_graph->options(); graph_opt.comp_node_seq_record_level = 1; // 2 Copy code
Corresponding , If you want to pass load_and_run Pre test performance , Can be implemented load_and_run when , Add command line arguments --record-comp-seq or --record-comp-seq2.
In our engineering practice , A model is in the current mainstream android Reasoning speed on mobile phones , There are about 10% About improvement .
This paper introduces... From the aspects of principle and application MegEngine Of NCHW44 and record Two optimization methods , They are just two effective methods we try to find when optimizing the reasoning performance of a key point model . The effectiveness of the optimization method depends on the characteristics of the model , So for a specific model , You can try MegEngine Other optimization options , Choose a more appropriate method . Of course , Optimization is multifaceted , In addition to the model reasoning itself , Optimize pretreatment and post-processing , Reduce data replication , about Android Reasonable setting of equipment CPU Kinship, etc , It is also a scheme that can be tried and considered .
Official website ：MegEngine- Deep learning , Simple development
Welcome to join MegEngine Technical communication QQ Group ：1029741705
author[MegEngine],Please bring the original link to reprint, thank you.
The sidebar is recommended
- Spring IOC container loading process
- [thinking] the difference between singleton mode and static method - object-oriented programming
- Hadoop environment setup (MySQL environment configuration)
- 10 minutes, using node JS creates a real-time early warning system for bad weather!
- Git tool
- Force deduction algorithm - 92 Reverse linked list II
- What is the sub problem of dynamic programming?
- C / C + +: static keyword summary
- Idea does not have the artifacts option when configuring Tomcat
- Anaconda can't open it
guess what you like
I don't know how to start this
Matlab simulation of transportation optimization algorithm based on PSO
MySQL slow log optimization
[Vue] as the window is stretched (larger, smaller, wider and higher), the text will not be displayed
Popular Linux distributions for embedded computing
Suzhou computer research
After installing SSL Certificate in Windows + tomcat, the domain name request is not successful. Please answer!!
Implementation time output and greetings of jQuery instance
The 72 year old uncle became popular. Wu Jing and Guo fan made his story into a film, which made countless dreamers blush
How to save computer research
- Springboot implements excel import and export, which is easy to use, and poi can be thrown away
- The final examination subjects of a class are mathematical programming, and the scores are sorted and output from high to low
- Two pronged approach, Tsinghua Professor Pro code JDK and hotspot source code notes, one-time learning to understand
- C + + recursive knapsack problem
- The use of GIT and GitHub and the latest git tutorial are easy to understand -- Video notes of crazy God speaking
- PostgreSQL statement query
- Ignition database test
- Context didn't understand why he got a high salary?, Nginxfair principle
- Bootstrap switch switch control user's guide, springcloud actual combat video
- A list that contains only strings. What other search methods can be used except sequential search
- [matlab path planning] multi ant colony algorithm grid map path planning [including GUI source code 650]
- [matlab path planning] improved genetic algorithm grid map path planning [including source code phase 525]
- Iinternet network path management system
- Appium settings app is not running after 5000ms
- Reactnative foundation - 07 (background image, status bar, statusbar)
- Reactnative foundation - 04 (custom rpx)
- If you want an embedded database (H2, hsql or Derby), please put it on the classpath
- When using stm32g070 Hal library, if you want to write to flash, you must perform an erase. If you don't let it, you can't write continuously.
- Linux checks where the software is installed and what files are installed
- SQL statement fuzzy query and time interval filtering
- 69. Sqrt (x) (c + + problem solving version with vs runnable source program)
- Fresh students are about to graduate. Do you choose Java development or big data?
- Java project: OA management system (java + SSM + bootstrap + MySQL + JSP)
- Titanic passenger survival prediction
- Vectorization of deep learning formula
- Configuration and use of private image warehouse of microservice architect docker
- For someone, delete return 1 and return 0
- How does Java dynamically obtain what type of data is passed? It is used to judge whether the data is the same, dynamic data type
- How does the database cow optimize SQL?
- [data structure] chain structure of binary tree (pre order traversal) (middle order traversal) (post order traversal) (sequence traversal)
- Webpack packaging optimization solution
- 5. Operation element
- Detailed explanation of red and black trees
- redhat7. 9 install database 19C
- Blue Bridge Cup notes: (the given elements are not repeated) complete arrangement (arrangement cannot be repeated, arrangement can be repeated)
- Detailed explanation of springboot default package scanning mechanism and @ componentscan specified scanning path
- How to solve the run-time exception of test times
- Detailed explanation of k8s management tool kubectl
- Android system view memory command