current position:Home>User practice | reasoning performance optimization of deep learning model based on megengine mobile terminal CPU

User practice | reasoning performance optimization of deep learning model based on megengine mobile terminal CPU

2022-01-26 23:50:22 MegEngine

User Practice Series , Will be included MegEngine Users' experience in the process of framework practice , I hope I can help my friends who have the same use scenario , Better understand and use MegEngine ~

author : Wang lei | Kuang depending on science and technology R & D Engineer


With the development of artificial intelligence technology and the continuous expansion of application fields , Mobile devices with weak computing power have become an important computing carrier of model reasoning , Optimizing its reasoning performance has become an important engineering problem . It is generally believed , Let the model run on GPU It will run on CPU It has great advantages in , Achieve significant performance improvements . This is usually the truth , however , In engineering practice, we also find , For some models with smaller dimensions , On mobile devices ,GPU Running does not bring performance improvement , It also introduces the problem of compatibility . therefore , In some application scenarios , We need to CPU It is the carrier of operation , Try everything , To improve the reasoning performance of the model .

In the engineering practice of optimizing the reasoning performance of a key point model , be based on MegEngine Inference engine , It is found that two optimization methods are more effective ,NCHW44 and Record. This paper will explain their principle and application method in detail .

NCHW44 Optimize


as everyone knows , Increasing the degree of parallelism is an important means to improve the computing speed . stay CPU On , That's what you need to use SIMD Instructions ——Single Instruction, Multiple Data, Single instruction multiple data , That is, execute a single instruction , The operation completes the operation of multiple data . For example, performing an addition operation , If not SIMD Instruction is a general addition instruction , You can only operate on one number at a time , In model reasoning , This number is often 8 position 、16 position , The biggest is just 32 Floating point number of bits , This is for modern 64 Bit register , It's really a waste . If you store multiple numbers in a register , One instruction completes the operation , It can double the calculation speed . stay x86 CPU On ,SIMD The implementation of SSE、AVX And so on , And in the ARM CPU On , It is NEON Instruction set . and CPU It also provides SIMD Special registers for instructions , stay x86 On the platform , The number of register bits is 128 position 、256 position , Even 512 position , stay ARM On the platform , The number of register bits is 128 position , So you can do it all at once 4 individual float32 The operation of data . therefore , If we can find a way to use as much as possible in model reasoning SIMD, Can improve the performance of reasoning .

Let's look at the use of... In model reasoning SIMD What's the problem . Usually , The tensor is stored in memory as NCHW( That is, the row and column data of each channel are arranged continuously , Then store each channel in sequence ), For example, when dealing with common convolution operations , The size of convolution kernel may vary , such as 3x3, Then you need to take one line at a time 3 A continuous pixel data is multiplied by the corresponding position data of the convolution kernel ( Then process other columns and channels ), And the corresponding SIMD Instructions , The registers used are usually 128 position , Use float32 It also needs to be handled once 4 Only one data can give full play to its advantages , These four data must be in adjacent positions in memory , So this calculation method greatly limits SIMD Instructions .

As an improvement , stay NCHW44( Also known as NC4HW4) Under the layout , Same location (HW) Of 4 The data of two channels are arranged together continuously , In the convolution operation, they participate in the calculation together , Every time SIMD Instruction execution can load them together into registers , This improves the computational efficiency . The picture below shows NCHW44 Data storage arrangement .


MegEngine Support two ways to use NCHW44 Optimize :

1. offline dump( serialize ) become NCHW44 Model , Inference time MegEngine Will automatically determine its arrangement , Execute the corresponding operator implementation . The following two are for dump Methods


 Copy code 

Keyword parameters are supported enable_nchw44, Set the parameter value to True, The output is NCHW44 Model of .

Corresponding , If you want to pass load_and_run Pre test performance , Can be used in sdk/load-and-run/ Adding parameters when scripting —enable-nchw44, The generated model can be load_and_run Load the executed nchw44 Model .

2. Switch on online ,dump Don't do... When modeling nchw44 To configure , Run through option Turn on the switch :

serialization::GraphLoader::LoadConfig load_config;

load_config.comp_graph = ComputingGraph::make();

auto &&graph_opt = ret.load_config.comp_graph->options();

 Copy code 

Corresponding , If you want to pass load_and_run Pre test performance , Can be implemented load_and_run when , Add command line arguments —enable-nchw44.

The two methods can be selected in combination with specific use conditions : If we develop sdk or app Multiple models may be loaded , Some use NCHW44 And some don't use , It is more suitable to choose Offline Mode ; If for some reason , We can't re dump Model ( For example, the original model file is missing ), You can only choose online mode .


In our engineering practice , A model is in the current mainstream android Reasoning speed on mobile phones , There are about 20%-30% About improvement .

record Optimize


When MegEngine When reasoning , The underlying execution is a static diagram , Its execution sequence is deterministic . For each operator in the graph , The implementation is divided into two steps : Get ready kernel And actual implementation . In the preparation kernel Stage ,MegEngine Will be based on filter size、stride、shape Such information determines the algorithm to be executed , That is, select the function to execute , namely kernel( For convolution , There may be many different implementations ). In the execution phase , Then actually call these functions .

If you choose the required basis, it remains unchanged ( The actual situation is mainly shape Don't change ), So this preparation kernel The process only needs to be performed once , And record the selected function objects in a list , When you execute it later , Take function objects directly and sequentially from the list , Execution can be . In this way, the preparation for subsequent implementation is saved kernel Time for . This is the same. record The meaning of the name .

at present MegEngine There are two levels of record.record1 Mainly to speed up the implementation of , The principle is described above ;record2 Mainly to save memory , If shape unchanged ,MegEngine You can analyze some information stored on the composition ( This information can be found in shape Used to do... When changing shape The derivation of ). For scenarios where we want to improve computing performance , commonly record1 More appropriate .

Be careful record One of the most important limitations of is shape Can't change . For some detection models , Depending on the size of the input drawing , On the model resize, In this case, you can't use record. For the model with constant input length, width and number of channels , Still need attention ,batch Parameters ( namely NCHW Medium N) Can't change , This may be overlooked . in addition , After the model is loaded , Before the first run , We can still change shape Of , As long as it doesn't change after the first run shape, It doesn't affect record Use .

except shape Beyond this condition , There are also some restrictions :

  1. All operators cannot rely on dynamic memory allocation , Because the recorded function object also contains input and output pointers , Dynamic memory changes ;

  2. Host The input / output pointer of the terminal cannot be changed ;

  3. Synchronization can only occur at the end of network execution , That is, it cannot be executed in the network , Perform synchronization at an intermediate node ;

  4. There cannot be more than one... In the whole graph compnode.

These conditions are for general use , It can basically satisfy .


stay option In the open

serialization::GraphLoader::LoadConfig load_config;

load_config.comp_graph = ComputingGraph::make();

auto &&graph_opt = load_config.comp_graph->options();

graph_opt.comp_node_seq_record_level = 1; // 2
 Copy code 

Corresponding , If you want to pass load_and_run Pre test performance , Can be implemented load_and_run when , Add command line arguments --record-comp-seq or --record-comp-seq2.


In our engineering practice , A model is in the current mainstream android Reasoning speed on mobile phones , There are about 10% About improvement .


This paper introduces... From the aspects of principle and application MegEngine Of NCHW44 and record Two optimization methods , They are just two effective methods we try to find when optimizing the reasoning performance of a key point model . The effectiveness of the optimization method depends on the characteristics of the model , So for a specific model , You can try MegEngine Other optimization options , Choose a more appropriate method . Of course , Optimization is multifaceted , In addition to the model reasoning itself , Optimize pretreatment and post-processing , Reduce data replication , about Android Reasonable setting of equipment CPU Kinship, etc , It is also a scheme that can be tried and considered .

attach :

GitHub:MegEngine tianyuan

Official website :MegEngine- Deep learning , Simple development

Welcome to join MegEngine Technical communication QQ Group :1029741705

copyright notice
author[MegEngine],Please bring the original link to reprint, thank you.

Random recommended