current position:Home>Compiled language backend principle notes

Compiled language backend principle notes

2022-09-23 09:27:16Book memories of Jiangnan

一、The compiler backend technology

1. 编译器的前端技术,The key is to let the compiler can read the program,No text of the structure of the code through the front end processing later,就变成了Token、AST和语义属性、Structured information such as symbol tables,基于这些信息,A simple script interpreter can be implemented.但很多情况下,Need to continue to compile a program into machine to be able to read the code,并高效运行.There are three problems:

(1)Must understand the principle of the computer running a program(run-time mechanism),Only in this way to know how to generate such program.

(2)To be able to use the front-end generatedAST和属性信息,The correct code is translated into the target.

(3)The program needs to be optimized as much as possible,Such as let program execution efficiency is higher,Less occupy a space, and so on.

Find out the three problems,Is the key to the smooth completion of the compiler back-end work.总的来说,The compiler backend to solve problem is:A computer is how to generate a program that can run,And still can make the program run correctly and efficiently on computer?Below is a model:

Basically is facing two hardware:

(1)CPU.It can accept machine instructions and data,并进行计算.It has register、Cache and arithmetic unit,Make full use of the register and cache will make the performance of the system greatly improve.

(2)内存.要Save the compiled code and data in memory,To design a set of mechanism,Let the program use this memory most efficiently.

可以看出,The compiler backend technology with very closely related to computer architecture,例如运行期In the mechanism of memory space how to divide up and organize;程序是如何启动、Jump and exit;How the execution of instructions and data is passed to theCPU;The whole process need how to operating system with,等等.Have some knowledge of the run-time mechanism after,You can proceed to the next step is generated in accordance with the run-time code mechanism.

2. The compiler back end the end result is生成目标代码.If the target is to run directly on a computer,就像CLanguage program so,Then this object code refers to the assembly code.And if the target isJava虚拟机,The target code isJVM的字节码.Write assembly and there are many different high-level languages,One of them is to careCPUSpecific hardware and memory this,For example, it is necessary to understand differentCPUThe difference of instruction set,还需要知道CPU是64位的还是32位的,有几个寄存器,What each register can be used for instruction and so on.But the problem caused by this is,Each language for each different hardware,The assembly code to generate different.

So, in order to reduce the back-end work,提高软件复用度,就需要引入中间代码(Intermediate Representation,IR)的机制,It is independent of specific hardware a code format.The front end of each language can be translated intoIR,然后再从IRTranslated into assembly code for different hardware architectures.如果有nA front-end language andmA backend architecture,Would have needed to dom*nA translator,现在只需要m+n个了,Greatly reduces the overall workload,如下图所示:

Even many language mainly completes the front line,The back-end can try to reuse existing libraries and tools,This is now one of the reasons for the new language faster and faster.像RustMake full use of theLLVM;GCCAll kinds of language such asC、C++、Object CIs also fully Shared the back-end technology such as.IR可以有多种格式,比如“x + y * z”Translated into a three-address code is the following,Every line of code is most involved three address,其中t1和t2是临时变量:

t1 := y * z

t2 := x + t1

JavaLanguage of the generated bytecode is also a kind ofIR,IRis an intermediate expression,Not necessarily is like assembly code lines instruction.所以ASTIn fact, it can also be regarded as aIR,Scripting language implemented in the front-end part,就是基于AST这个IR来运行的.每种IRThe purpose and usage is not the same:

(1)ASTMainly used for front-end work.

(2)Java的字节码,is designed to run on a virtual machine.

(3)LLVM的中间代码,Code is mainly used to do translation and compiling optimization.

3. 总的来说,Can translate languages into intermediate code,For each target architecture again,Through a program that will be translated into intermediate code corresponding assembly code is ok.生成正确的、Able to perform the code is simple,Can such a code execution efficiency is very low,因为Direct translation of the generated code is often not enough concise,Such as generates a lot of temp,Order quantity also more.Because the translator first care is correct,It is difficult to consider whether it is sufficient to optimize at the same time.另一方面,As a result of the limitation of the high-level language itself and the programmer programming habits,Can also lead to less than optimal code,Inability to fully utilize the performance of the computer.所以一定要The code to do optimization.

Optimization work is divided into“Machine independent optimization”和“machine-dependent optimization”两种.Independent of the optimization is based on the machineIR进行的,It can be based on the analysis of the code,In a more efficient code instead of the original code.比如下面这段代码中的foo()函数,There are several places you can optimize,Even the whole offoo()The function call also can omit,因为foo()的值一定是101,The optimization work at compile time to do:

int foo(){
    int a = 10*10;  //这里在编译时可以直接计算出100这个值
    int b = 20;     //这个变量没有用到,可以在代码中删除

    if (a>0){       //因为a一定大于0,所以判断条件和else语句都可以去掉
        return a+1; //这里可以在编译器就计算出是101
        return a-1;
int a = foo();      //It can be directly replaced by a=101;

machine-dependent optimization,Is dependent on the characteristics of the hardware.A lot of modern computer hardware design features,In order to provide more processing power,Such as parallel computing ability、Multi-level memory structure(The use of multiple levels of cache)等等.The compiler to be able to make full use of the performance of the hardware,比如:

(1)寄存器优化.For frequently accessed variables,The best in the register,And try to maximize the register,don't leave some of them empty,There are many algorithms to solve this problem,In the textbook, the coloring algorithm is generally mentioned.;

(2)充分利用高速缓存.Cache access speed can be dozens of times on one hundred times faster than the memory,所以要Make the most of the cache.比如,Data that a piece of code operates on,As far as possible in the memory together,这样CPU读入数据时,Together in the cache,No need to re-access memory over and over again.

(3)并行性.Modern computer has multiple kernel can parallel computing,编译器要As far as possible, make full use of the multiple kernel ability to calculate.

(4)流水线.CPUIn dealing with different instructions,The time period to wait is different,In the process of waiting for some instruction to finish actually can also perform other instructions.

(5)指令选择.有时候CPU完成一个功能,There are multiple commands to choose from.for a specific need,采用AInstructions may beBInstruction is one hundred times of high efficiency.比如X86架构的CPU提供SIMD功能,Is an instruction that can handle multiple data,Rather than the traditional instruction as an instruction can only handle a data.In the field of computational memory,SIMD也可以大大提升性能.

(6)其他优化.For example, for dedicatedAI芯片和GPU做优化,提供AI计算能力等等.

可以看出,Completes the relies on the optimization of the machine to the target machine's architecture has a clear understanding of,So some system-level software development also will be more comfortable.实际上,数据库系统、Big data system and so on are merged compilation techniques.The optimization of code is needed in the compiler very much.因此,Optimization work is also in the process of compiling the longest、Most embody a compiler capability of work.

编译器的后端,Is to translate high-level language into the target language that the computer can understand.It has a different focus than the front end.Front end code is correctly reflect the meaning of the static structure,Then end focus is to make the code running the dynamic structure of the,The idea of the scope is the front,And lifetime is the concept of backend.


4. 程序运行的过程中,主要是跟两个硬件(CPU和内存)以及一个软件(操作系统)打交道,如下图所示:

Most programs essentially only care aboutCPU和内存这两个硬件.对于CPU重点关注的是registers and cache,它们跟程序的执行机制和优化密切相关.寄存器是CPUInstruction when calculated temporary data storage place.CPU指令一般都会用到寄存器,Such as the addition(c=a+b)的过程是这样的:




寄存器的速度也很快,所以Can don't use memory registers.尽量充分利用寄存器,是编译器做优化的内容之一.而The cache can make up forCPU的处理速度和内存访问速度之间的差距,So when an instruction reads a piece of data in memory,It is not read-only data into the current instruction needed,而是把跟这个数据相邻的一组数据都读进高速缓存了.So to write the program,应尽量把某个操作所需的数据都放在内存中的连续区域中,不要零零散散地到处放,这样有利于充分利用高速缓存.这种优化思路,叫做数据的局部性.

程序在运行时,操作系统会给它分配一块虚拟的内存空间,让它在运行期可以使用.目前使用的都是64位的机器,可以用一个64位的长整型来表示内存地址,All addresses it can represent are called寻址空间.64位机器的寻址空间就有2的64次方那么大,达到TB级别.在存在操作系统的情况下,程序逻辑上可使用的内存一般大于实际的物理内存.程序在使用内存时,操作系统会把程序使用的逻辑地址映射到真实的物理内存地址,有的物理内存区域会映射进多个进程的地址空间,如下图所示:

对于不太常用的内存数据,操作系统会写到磁盘上,以便腾出更多可用的物理内存.Of course there are also no operating system,At this time of the program memory is used by physical memory,We must do the management of the memory.

5. C语言和Java虚拟机对内存的管理和使用策略就是不同的.尽管如此,大多数语言还是会采用一些通用的内存管理模式.以C语言为例,会把内存划分为代码区、静态数据区、栈和堆,如下图所示:

一般来讲,代码区is in the lowest address area,然后是静态数据区.而The stack is traditionally extending from the high to low addresses,栈的最顶部有一块区域,用来保存环境变量.

(1)代码区(也叫文本段)存放编译完成以后的机器码.The memory areas are read-only won't change,但也不绝对.现代语言的运行时已经越来越动态化,In addition to save the machine can also store the intermediate code,并且还可以在运行时把中间代码编译成机器码,写入代码区.

(2)静态数据区保存程序中全局的变量和常量.它的Address at compile time is determine,在生成的代码里直接使用这个地址就可以访问它们,它们的Survival from the program start until the end.它又可以细分为Data和BSS两个段.Data段中的变量是在编译期就初始化好的,Load directly from program into memory.BSS段中是那些没有声明初始化值的变量,都会被初始化成0.

(3)堆适合管理生存期较长的一些数据,这些数据在退出作用域以后也不会消失.Such as in a method creates and returns an object,并希望代表这个对象的数据在退出函数后仍然可以访问.

(4)Stack for your lifetime is shorter data,比如函数和方法里的本地变量.它们在进入某个作用域的时候申请内存,退出这个作用域的时候就可以释放掉.

Take a look at dealing with the program of the operating system.On the one hand, program can be compiled into don't need operating systems can run,Like some Internet applications run on bare equipment completely;另一方面,With the help of the operating system can provide convenience for the program,比如可以使用超过物理内存的存储空间,操作系统负责进行虚拟内存的管理.Program in the operating system needs to comply with the provisions of the include:程序文件的二进制格式约定,So that the operating system can properly loaded in the procedure,并为同一个程序的多个进程共享代码区.When using stack registers and also to abide by some conventions,Facilitate the operating system to switch between different process、When making a system call,做好上下文的保护.

6. Next, look at the process of running the program.首先,可运行的程序一般是由操作系统加载到内存的,并且定位到代码区里程序的入口开始执行,比如C语言的main函数的第一行代码.Every time they load a code order,Met jump statements to jump to another address.CPU里有一个指令寄存器,里面保存了下一条指令的地址,如下图所示:

Suppose to run such a program code compiled form:

int main(){
  int a = 1;

int foo(int c){
    int b = 2;
    return b+c;

int bar(){
    return foo(4) + 1;

The above code is activated first(Activate)了main()函数,main()函数又激活foo()函数,然后又激活bar()函数,bar()函数还会激活foo()函数,其中foo()函数被两次以不同的路径激活:

The process of calling a function at a time,叫做一次活动(Activation).每个活动都对应一个活动记录(Activation Record),这个活动记录里有这个函数运行所需要的信息,比如参数、返回值、本地变量等.Currently using the stack to manage memory,所以可以把活动记录等价于栈桢.栈桢是活动记录的实现方式,Free design activity records or the stack frame structure,下图是一个常见的设计:

(1)返回值:一般放在最顶上,这样它的地址是固定的.foo()函数返回以后,它的调用者可以到这里来取到返回值.In the actual situation will bePrefer to pass return values ​​through registers,比通过内存传递性能更高.

(2)参数:在调用foo函数时,把参数写到这个地址里.It can also be passed through registers,而不是内存.


(4)返回地址:foo函数执行完毕以后,继续执行哪条指令.同样,Registers can be used to store this information.


(6)寄存器信息:Function in the stack frame also often save register data.如果在foo函数里要使用某个寄存器,可能需要先把它的值保存下来,防止破坏了别的代码保存在这里的数据.This convention is calledCallee responsibility,也就是使用寄存器的人要保护好寄存器里原有的信息.某个函数如果使用了某个寄存器,但它又要调用别的函数,为了防止别的函数把自己放在寄存器中的数据覆盖掉,要自己保存在栈桢中.栈的结构如下图所示:

可以看到,栈桢就是一块确定的内存,变量就是这块内存里的地址.每个栈桢的长度是不一样的.用到的参数和本地变量多,栈桢就要长一点.但是栈桢的长度和结构是在编译期就能完全确定的,This makes it easy to calculate the offset of the address,获取栈桢里某个数据.In general, the design of the stack frame is very free,But to consider different language compiler module to be able to link together,So still have to abide by some common conventions,Or write their own the function of other people can't do the call.

7. 了解了栈桢的实现之后,Look at a larger scene,从全局的角度看看整个运行过程中都发生了什么,如下图所示:

代码区里存储了一些代码,main函数、bar函数和foo函数各自有一段连续的区域来存储代码,Using the above some assembly instructions to show the code,But the actual runtime is machine code here.假设执行到foo函数中的一段指令,来计算“b+c”的值并返回,这里用到了mov、add、jmp这三个指令,mov是把某个值从一个地方拷贝到另一个地方,addIs to add a value somewhere,jmp是改变代码执行的顺序,跳转到另一个地方去执行,如下所示:

mov b的地址 寄存器1
add c的地址 寄存器1
mov 寄存器1 foo的返回值地址
jmp 返回地址  //或ret指令

执行完这几个指令以后,foo的返回值位置就写入了6,并跳转到bar函数中执行foo之后的代码.这时foo的栈桢就没用了,新的栈顶是bar的栈桢的顶部.In theory, the operating system can thenfoo的栈桢所占的内存收回了,For instance can be mapped to another program addressing space,让另一个程序使用.But in this case will see,即使返回了barFunction still outside a memory address to access the stack,也就是返回值的地址.

所以,目前的调用约定都规定,The stack of the program will still have a small piece of memory(比如128K)是可以由程序访问的,For instance can be used to store the return value,这一小段内存操作系统并不会回收.堆的使用也类似,只不过是要手工进行申请和释放,比栈要多一些维护工作.


8. For statically compiled languages,比如C和Go语言,Compiler back end task is to generate assembly code,And then by the compiler to generate machine code,The generated files is called the target file,Finally, using the linker can generate an executable file or a library file,如下图所示:

就算像JavaScriptInterpreted Execution Language,Also want to use a similar mechanism to generate machine code at runtime,in order to increase the speed of execution.Java的字节码,Usually at run time byJITMechanism is compiled into machine code,而Assembly language is the basis for doing these jobs.Machine language is0101的二进制数据,Not suitable for human to read,Machine language and assembly language is more readable,Basically every instruction can be directly translated into a machine code.Compared with the high-level language of daily use,Assembly language grammar very simple,But it should follow the hardware(CPU和内存)打交道.

计算机的处理器有很多不同的架构,比如x86-64、ARM、Power等,每种处理器的指令集都不相同,那也就意味着汇编语言不同.下面用CExamples of language to generate the assembly code:

#include <stdio.h>
int main(int argc, char* argv[]){
    printf("Hello %s!\n", "Richard");
    return 0;

在macOS中输入下面的命令,其中的-SParameter is tells the compiler to compile the source code into the assembly code,而-O2parameters tell the compiler to do2级优化,The resulting assembly code will be shorter:

clang -S -O2 hello.c -o hello.s


gcc -S -O2 hello.c -o hello.s

The generated assembly code looks like the following:

 .section    __TEXT,__text,regular,pure_instructions
    .build_version macos, 10, 14    sdk_version 10, 14
    .globl  _main                   ## -- Begin function main
    .p2align    4, 0x90
_main:                                  ## @main
## %bb.0:
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register %rbp
    leaq    L_.str(%rip), %rdi
    leaq    L_.str.1(%rip), %rsi
    xorl    %eax, %eax
    callq   _printf
    xorl    %eax, %eax
    popq    %rbp
                                        ## -- End function 
    .section    __TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
    .asciz  "Hello %s!\n"

L_.str.1:                               ## @.str.1
    .asciz  "Richard"


If you lay the command,Will compile the assembly code into executable file(在macOS或Linux执行as命令,就是调用了GNU的汇编器):

as hello.s -o hello.o   //In assembler compiled into object files
gcc hello.o -o hello   //链接成可执行文件 
./hello                 //运行程序

以上面的代码为例,Let's take a look at the components of assembly language.This code has instruction、伪指令、Four elements of labels and comments,每个元素单独占一行.指令(instruction)是直接由CPUProcessing the command,例如:

pushq   %rbp
movq    %rsp, %rbp

其中,A word that starts with is助记符(mnemonic),后面跟着的是操作数(operand),Separate multiple operands with commas.The meaning of the second line of code is to put the data from here(源)Copy to there(目的).而伪指令以“.”开头,末尾没有冒号“:”,如下所示:

.section    __TEXT,__text,regular,pure_instructions
.globl  _main        
.asciz  "Hello %s!\n"

伪指令是是辅助性的,汇编器在生成目标文件时会用到这些信息,但伪指令不是真正的CPU指令,just for the assembler,每种汇编器的伪指令也不同.

Labels are preceded by colons“:”结尾,用于对伪指令生成的数据或指令做标记,例如L_.str:A label is a marker for a string.Other code can access label,For example, jump to the instruction that the label marked,如下所示:

L_.str:                                 ## @.str
    .asciz  "Hello %s!\n"

标签很有用,它可以代表一段代码或者常量的地址(Which place in the code area or static data area).But a start can't know the address of the specific value,Must generate the target file to calculate,所以标签会简化汇编代码的编写.The fourth element is a comment,以“#”号开头,这跟C语言中以//Indicates that the comment statement is the same.

9. Instructions are the main part of assembly code,In the code mnemonic“movq”、“xorl”中的“mov”和“xor”是指令,而“q”和“l”叫作后缀表示操作数的位数.后缀一共有b, w, l, q四种,分别代表8、16、32和64位,如下图所示:



movl $40, %eax

(2)The most common thing to see in the command is to寄存器的访问,GNUThe assembler rulesBe sure to register in order to%开头.

(3)直接内存访问.When in the code to seewhen operand is a number,它其实指的是内存地址,因为数字立即数必须以$开头.In addition, in the assembly code标签,也会被翻译成直接内存访问的地址,比如“callq  _printf”中的“_printf”是一个函数入口的地址.The compiler will calculate the program loaded in memory,每个字面量和过程的地址.

(4)间接内存访问.带有括号,比如%rbp)是指%rbpRegister points to the value of address,The stack pointer to the bottom of the stack of the values in the host address.间接内存访问的完整形式是:“偏移量(基址,索引值,字节数)”这样的格式,其地址是:基址+索引值*字节数+偏移量,例如:8(%rbp),是比%rbp寄存器的值加8;-8(%rbp),是比%rbp寄存器的值减8;(%rbp, %eax, 4)的值,等于%rbp + %eax*4,这个地址格式相当于访问Celements in an array of languages,数组元素是32位的整数,其索引值是%eax,而数组的起始位置是%rbp,其中字节数只能取1,2,4,8四个值.

To understand the format of the instruction after,Next, look at some commonly used commands:

(1)mov指令.格式为“mov 寄存器|内存|立即数, 寄存器|内存”.这个指令最常用到,用于在寄存器或内存之间传递数据,或者把立即数加载到内存或寄存器.movInstruction of the first parameter is the source,可以是寄存器、内存或立即数.第二个参数是目的地,可以是寄存器或内存.

(2)lea指令(load effective address).装载有效地址,格式为:“lea 源,目的”.The code examples such as the frontleaq指令,Is the address of the string is loaded into the%rdi寄存器,如下所示:

leaq    L_.str(%rip), %rdi


movl -4(%rbp), %eax    #把%rbp-4的值拷贝到%eax
addl -8(%rbp), %eax   #把%rbp-8地址的值加到%eax上
movl %eax, -12(%rbp)   #把%eax的值写到内存地址%rbp-12

这三行代码,分别是操作a、b、c三个变量的地址.Their addresses are%rbp的值减4、减8、减12,因此a、b、cEach of the three variables is4Bytes long, that is,32位,They are close to the store,并且是From high to low addresses extended,This is the characteristics of the stack.除了add以外,Instructions for other arithmetic operations are as follows:

Related to the stack operation as shown below:

The procedure call operation is as follows:

10. 在汇编代码中,Often have to use all sorts of to%the sign of the register at the beginning.x86-64架构的CPU里有很多寄存器,The most commonly used in code is16个64位的通用寄存器,分别是:


这些寄存器在历史上有各自的用途,比如rax中的“a”是Accumulator(累加器)的意思,这个寄存器是累加寄存器.但随着技术的发展,这些寄存器基本上都成为了通用的寄存器,不限于某种特定的用途.But in order to facilitate software to write,Here are some conventions,给这些寄存器划分了用途.针对x86-64架构有多个调用约定(Calling Convention),包括微软的x64调用约定(Windows)、System V AMD64 ABI(Unix和Linux)等,下面的内容属于后者:

(1)%rax除了其他用途外,Usually when the function returns,把返回值放在这里.


(3)%rdi,%rsi,%rdx,%rcx,%r8,%r9给函数传整型参数,依次对应第1参数到第6参数.超过6The parameters are placed in the stack frame.

(4)如果程序要使用%rbx,%rbp,%r12,%r13,%r14,%r15这几个寄存器,是由被调用者(Callee)负责保护的,也就是Write to the stack,在返回的时候要恢复这些寄存器中原来的内容.Other is the contents of a register by the调用者(Caller)负责保护,如果不想这些寄存器中的内容被破坏,那么要自己保护起来.

上面这些寄存器的名字都是64位的名字,对于Each register can also use only part of it,并且另起一个名字.比如对于%rax如果使用它的前32位,就叫做%eax,前16位叫%ax,前8位(0到7位)叫%al,8到15位叫%ah,如下图所示:

The use of other registers have such way,When in the assembly code to see the following names,Will know that in fact they may register in physics are the same:

除了通用寄存器以外,If possible to know more about the following registers and USES,Write assembly code is often associated with them:


(2)8个64位的MMX寄存器,用于MMX指令(即多媒体指令),这8个跟x87寄存器在物理上是相同的寄存器.To use when passing floating-point parametersmmx寄存器.



(5)flags(64位:rflags,32位:eflags)寄存器:每个位用来标识一个状态.Such as they will be used to compare and jump instruction,例如if语句翻译成的汇编代码,就会用它们来保存if条件的计算结果.

因此,Assembly code has to deal with registers everywhere,Correct and efficient use of registers,is one of the important tasks of the compiler backend.

11. The compiler backend can be based onASTThe custom beforeplayscripttranslate into correct assembly code,The assembly code and will be compiled into an executable program.先来看看如何从playscript生成汇编代码,For example to support the function call and pass、Integer addition(To make full use of registers to improve performance)、变量声明和初始化.具体来说,To be able to put the following sample program correctly generated assembly code:

int fun1(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8){
    int c = 10; 
    return x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + c;

println("fun1:" + fun1(1,2,3,4,5,6,7,8));

Function is equivalent to theplayscriptCode hand-written assembly code is as follows,How involved in more than6pass parameters in the case of,and the change process of the stack frame:

# function-call2-craft.s 函数调用和参数传递
    # 文本段,纯代码
    .section    __TEXT,__text,regular,pure_instructions

    # 函数调用的序曲,设置栈指针
    pushq   %rbp           # 把调用者的栈帧底部地址保存起来   
    movq    %rsp, %rbp     # 把调用者的栈帧顶部地址,设置为本栈帧的底部

    movl    $10, -4(%rbp)  # 变量c赋值为10,也可以写成 movl $10, (%rsp)

    # 做加法
    movl    %edi, %eax     # 第一个参数放进%eax
    addl    %esi, %eax     # 加参数2
    addl    %edx, %eax     # 加参数3
    addl    %ecx, %eax     # 加参数4
    addl    %r8d, %eax     # 加参数5
    addl    %r9d, %eax     # 加参数6
    addl    16(%rbp), %eax  # 加参数7
    addl    24(%rbp), %eax  # 加参数8
    addl    -4(%rbp), %eax # 加上c的值

    # 函数调用的尾声,恢复栈指针为原来的值
    popq    %rbp           # 恢复调用者栈帧的底部数值
    retq                   # 返回

    .globl  _main          # .global伪指令让_main函数外部可见
_main:                                  ## @main
    # 函数调用的序曲,设置栈指针
    pushq   %rbp           # 把调用者的栈帧底部地址保存起来  
    movq    %rsp, %rbp     # 把调用者的栈帧顶部地址,设置为本栈帧的底部
    subq    $16, %rsp      # 这里是为了让栈帧16字节对齐,实际使用可以更少

    # 设置参数
    movl    $1, %edi     # 参数1
    movl    $2, %esi     # 参数2
    movl    $3, %edx     # 参数3
    movl    $4, %ecx     # 参数4
    movl    $5, %r8d     # 参数5
    movl    $6, %r9d     # 参数6
    movl    $7, (%rsp)   # 参数7
    movl    $8, 8(%rsp)  # 参数8

    callq   _fun1                # 调用函数

    # 为printf设置参数
    leaq    L_.str(%rip), %rdi   # 第一个参数是字符串的地址
    movl    %eax, %esi           # 第二个参数是前一个参数的返回值

    callq   _printf              # 调用函数

    # 设置返回值.这句也常用 xorl %esi, %esi 这样的指令,都是置为零
    movl    $0, %eax

    addq    $16, %rsp    # 缩小栈
    # 函数调用的尾声,恢复栈指针为原来的值
    popq    %rbp         # 恢复调用者栈帧的底部数值
    retq                 # 返回

    # 文本段,保存字符串字面量                                  
    .section    __TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
    .asciz  "fun1 :%d \n"

12. Then write program,从AST翻译成汇编代码,如AsmGen.javaImplemented in addition the translation process is as follows:

case PlayScriptParser.ADD:
    //For addition to apply for a temporary storage location,can be registers and stack
    address = allocForExpression(ctx);
    bodyAsm.append("\tmovl\t").append(left).append(", ").append(address).append("\n");  //Copy the left node to the storage space
    bodyAsm.append("\taddl\t").append(right).append(", ").append(address).append("\n");  //The nodes on the right and to

这段代码的含义是:通过allocForExpression()方法,For each addition to apply for a temporary space(可以是寄存器,or an address on the stack),Used to store the result of the addition operation.接着用movCommand the plus sign to the left of the values copied into the temporary space,再用addInstruction and the value on the right.生成汇编代码的过程,基本上就是基于AST拼接字符串,其中bodyAsm变量是一个StringBuffer对象,可以用StringBuffer的toString()Methods to obtain the final assembly code.按照上面的逻辑,针对“x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + c”这个表达式,The resulting assembly code is as follows:

 # 过程体
    movl    $10, -4(%rbp)
    movl    %edi, %eax       //x1
    addl    %esi, %eax       //+x2
    movl    %eax, %ebx    
    addl    %edx, %ebx       //+x3
    movl    %ebx, %r10d
    addl    %ecx, %r10d      //+x4
    movl    %r10d, %r11d 
    addl    %r8d, %r11d      //+x5
    movl    %r11d, %r12d
    addl    %r9d, %r12d      //+x6
    movl    %r12d, %r13d
    addl    16(%rbp), %r13d  //+x7
    movl    %r13d, %r14d
    addl    24(%rbp), %r14d  //+x8
    movl    %r14d, %r15d
    addl    -4(%rbp), %r15d  //+c,本地变量

You can see the above paragraphJavaThe problem of parsing code,Every time when performing a add operation is to take up a new register.比如x1+x2使用了%eax,再加x3时使用了%ebx,According to this speed register soon ran out,Efficiency obviously is not high.So must do code optimization.If only simply translate the code mechanically,Equivalent to generating a large number of temporary variables,Every temporary variables occupy a space,如下所示:

t1 := x1 + x2;
t2 := t1 + x3;
t3 := t2 + x4;

Code optimization can no longer use storage location(t1,t2,t3…)能够复用,To reduce temporary variables also lines of code,The optimized application for temporary storage method is as follows:

//The storage location of the multiplexed preamble
if (ctx.bop != null && ctx.expression().size() >= 2) {
    ExpressionContext left = ctx.expression(0);
    String leftAddress = tempVars.get(left);
    if (leftAddress!= null){
        tempVars.put(ctx, leftAddress); //The current node is also associated with this address
        return leftAddress;

这段代码的意思是:For each additive operation had to apply for a register,If a plus sign to the left of the nodes in a register,The direct reuse this register,Don't use the new.调整以后,The generated assembly code can also write the same,And only used%eax一个寄存器,Code number also cut in half,优化效果明显,如下所示:

# 过程体
    movl    $10, -4(%rbp)
    movl    %edi, %eax
    addl    %esi, %eax
    addl    %edx, %eax
    addl    %ecx, %eax
    addl    %r8d, %eax
    addl    %r9d, %eax
    addl    16(%rbp), %eax
    addl    24(%rbp), %eax
    addl    -4(%rbp), %eax

    # 返回值
    # The return value in the calculation of before,已经存入%eax

To fully optimize code how to use the register,Is a must do the compiler back-end work.Here took only a very rough way,不具备实用价值,A better optimization algorithm can be seen later.

13. Figuring out the logic additive operation code translation,再看看AsmGen.java中generate()和generateProcedure()方法,How the generation of assembly code complete logic.It will help understand the overall context and all the details,For example, how is the label of the function generated?,how the prelude and epilogue were added,How the address of a local variable is calculated,等等,如下所示:

public String generate() {
    StringBuffer sb = new StringBuffer();

    // 1.The head of the code
    sb.append("\t.section  __TEXT,__text,regular,pure_instructions\n");

    // 2.Generating function code
    for (Type type : at.types) {
        if (type instanceof Function) {
            Function function = (Function) type;
            FunctionDeclarationContext fdc = (FunctionDeclarationContext) function.ctx;
            visitFunctionDeclaration(fdc); // 遍历,代码生成到bodyAsm中了
            generateProcedure(, sb);

    // 3.The main program to generate_main函数
    visitProg((ProgContext) at.ast);
    generateProcedure("main", sb);

    // 4.文本字面量
    sb.append("\n# 字符串字面量\n");
    sb.append("\t.section  __TEXT,__cstring,cstring_literals\n");
    for(int i = 0; i< stringLiterals.size(); i++){
        sb.append("L.str." + i + ":\n");

    // 5.Global reset some temp
    return sb.toString();

generate()The method is the entry point of the entire translation program,It made several work:

(1)生成一个.section伪指令,Indicates that this is a code snippet that puts text.

(2)遍历AST中的所有函数,调用generateProcedure()The assembly code generated a method for each function.

(3)Then generate an entry for the main program.


Above the entire program structure and finally the structure of the generated assembly code is consistent with the.generateProcedure()method to convert a function to assembly code,如下所示:

private void generateProcedure(String name, StringBuffer sb) {
    // 1.函数标签
    sb.append("\n## 过程:").append(name).append("\n");
    sb.append("\t.globl _").append(name).append("\n");

    // 2.序曲
    sb.append("\n\t# 序曲\n");
    sb.append("\tmovq\t%rsp, %rbp\n");

    // 3.设置栈顶
    // 16字节对齐
    if ((rspOffset % 16) != 0) {
        rspOffset = (rspOffset / 16 + 1) * 16;
    sb.append("\n\t# 设置栈顶\n");
    sb.append("\tsubq\t$").append(rspOffset).append(", %rsp\n");

    // 4.Save to use the register values of

    // 5.函数体
    sb.append("\n\t# 过程体\n");

    // 6.Restore the protected register values

    // 7.恢复栈顶
    sb.append("\n\t# 恢复栈顶\n");
    sb.append("\taddq\t$").append(rspOffset).append(", %rsp\n");

    // 8.如果是main函数,设置返回值为0
    if (name.equals("main")) {
        sb.append("\n\t# 返回值\n");
        sb.append("\txorl\t%eax, %eax\n");

    // 9.尾声
    sb.append("\n\t# 尾声\n");

    // 10.重置临时变量
    rspOffset = 0;
    bodyAsm = new StringBuffer();

Its work includes:

(1)Generating function label、Prelude part code、设置栈顶指针、Protection register the original values, etc.

(2)Then the function body,such as local variable initialization、Do add operation, etc.

(3)Finally, a series of finishing touches,including restoring the value of protected registers、Restore the stack pointer,and the code for the epilogue.

最后可以通过-S参数运行,将asm.playfile to generate assembly code fileasm.s,Generating and run an executable file:

java play.PlayScript -S -o asm.s   //生成汇编代码
gcc asm.s -o asm                            //生成可执行文件
./asm                                       //运行可执行文件

14. 到目前为止,Have successfully compiledplayscript程序,并生成了可执行文件.In order to deepen the understanding of the generated executable file,再用 playscript生成目标文件,让C语言来调用.这样可以证明playscriptThe logic of generating assembly code is reliable,That can be usedplayscript代替CLanguage to write a common module.

在编程时,Often call some public libraries to realize some function,These libraries might be from another language,但仍然可以调用.这里也可以实现playscriptShared with the function of the other languages,In the example program implementation is simple,The assembly code generated by the fine-tuning the,使用“.global _fun1”伪指令让_fun1Process into a global,Such programs written in other languages can call this_fun1过程,Reuse of functionality,如下所示:

# convention-fun1.s Test the calling convention,_fun1Will be calling in external
    # 文本段,纯代码
    .section    __TEXT,__text,regular,pure_instructions
    .globl  _fun1          # .global伪指令让_fun1函数外部可见
    # 函数调用的序曲,设置栈指针
    pushq   %rbp           # 把调用者的栈帧底部地址保存起来   
    movq    %rsp, %rbp     # 把调用者的栈帧顶部地址,设置为本栈帧的底部

    movl    $10, -4(%rbp)  # 变量c赋值为10,也可以写成 movl $10, (%rsp)

    # 做加法
    movl    %edi, %eax     # 第一个参数放进%eax
    addl    %esi, %eax     # 加参数2
    addl    %edx, %eax     # 加参数3
    addl    %ecx, %eax     # 加参数4
    addl    %r8d, %eax     # 加参数5
    addl    %r9d, %eax     # 加参数6
    addl    16(%rbp), %eax  # 加参数7
    addl    24(%rbp), %eax  # 加参数8
    addl    -4(%rbp), %eax # 加上c的值

    # 函数调用的尾声,恢复栈指针为原来的值
    popq    %rbp           # 恢复调用者栈帧的底部数值
    retq                   # 返回

接下来再写一个CThe language function to callfun1(),其中的externThe keyword description has afun1()The function is implemented in another module:

 * convention-main.c Test the calling convention.Call an external functionfun1
#include <stdio.h>

//声明一个外部函数,will be found in other modules when linking
extern int fun1(int x1, int x2, int x3, int x4, int x5, int x6, int x7, int x8);

int main(int argc, char *argv[])
    printf("fun1: %d \n", fun1(1,2,3,4,5,6,7,8));
    return 0;

Then type the following two commands on the command line:

# 编译汇编程序
as convention-fun1.s -o convention-fun1.o

# 编译C程序
gcc convention-main.c convention-fun1.o -o convention

(1)第一个命令,把playscript The generated assembly code compiled into a binary object file.

(2)The second command in the buildC程序时,Also bring this binary file,then the compiler will find un1()函数的定义,并链接到一起.The generated executable file can run smoothly.

The linking process needs to be explained here,A high-level language and assembly language is actually easy to read.The binary file is friendly to computer,便于运行.The compiler can compile each assembly file to generate a binary object files,or a module,While the linker to these modules assembled into a whole.但在Cin the module generated by the language,调用fun1()Function when it has no way to knowfun1()Function of the accurate address,Because this address must be the entire file can be calculated after all assembled finished.So the assembler defers this task,Leave it to the linker to resolve,如下图所示:

编译一个程序,The end result is to generate binary can run.Actually generated assembly code later,You can think to complete the task of the compiler,At the back of the work is done by the compiler and linker.But the whole process can also be regarded as compilation process,To understand the structure of the binary,Also to fully understand the whole process of compiling the end.Of course an understanding of the binary file format,Is a big project to do compile management、The basis for binary code analysis, etc..

for each operating system,For executable format requirements are not the same.比如Linux下,目标文件、共享对象文件、Binary files are usedELF格式.实际上,这些Binary file format and loaded into memory is very similar to the program in format,This can be quickly read by the operating system,and load it into memory,加载速度越快,Also, the faster it is equivalent to the program start.Same as in-memory layout,ELF格式中,Separate code and data is also.这样做的好处是,Program code section can be Shared among multiple processes,Don't need to put more in the memory,Put a code and then mapped to each process area to go.The data section is each process is not the same,So to load a copy for each process.

四、中间代码:Compatible with different languages ​​and hardware

15. IRThe role is ableBased on it, it connects to the front end of different languages,Also can butt different hardware architecture,Also can do a lot of optimization.首先来看看IR的特征.IRmeans intermediate expression,It is in the middle of a high-level language and assembly language,This means that its characteristic is in between the two.与高级语言相比,IRRejected most high-level language syntax and semantic characteristics of,比如循环语句、if语句、作用域、面向对象等等,It's more like a high level of assembly language;compared to real assembly language,It won't have so many trivial、The details of the related to specific hardware.

IR有很多种类(AST也是一种IR),每种IRhave different characteristics and uses,Some compilers even to use several differentIR.Here's the backend partIR,The purpose is to facilitate the execution of various optimization algorithms,and facilitates generating assembly,这种IRCan be seen as a high level of assembly language,主要体现在:

(1)it can use registers,But there is no limit to the number of register;

(2)Control structures are also similar to assembly language,Such as the jump statement,Divided into multiple block,With the tag to identify the block, etc;

(3)Use opcodes equivalent to assembly instructions.The operation code can translate into a one-on-one assembly code,But sometimes an opcode corresponding multiple assembly instruction.

Let's take a look at a typicalIR:三地址代码(Three Address Code, TAC.Its advantage is that it is very concise,So suitable for discussing algorithm,如下所示:

x := y op z   //二元操作
x := op y     //一元操作

Each of the three address code up to three,Two of them are source addresses(As the first line of codey和z),One is the destination address(也就是x),Each code has a most operation(op),For example, the arithmetic of code below:

int a, b, c, d;
a = b + c * d;


t1 := c * d
a  := b + t1

t1is a newly created temporary variable.When the expression of the source code contains more than one operator,need to introduce temporary variables,And the original one code into more code.For example, the following conditional statement code,如下所示:

int a, b c;
if (a < b )
    c = b;
    c = a;  
c = c * 2;      


 t1 := a < b;
  IfZ t1 Goto L1;
  c := a;
  Goto L2;
  c := b;
  c := c * 2;  

IfZis to check whether the following operand is0,“Z”就是“Zero”的意思.Use the tags and hereGotostatement to jump to an instruction(Goto 相当于x86-64的汇编指令jmp).Examples of loop statement as shown below:

int a, b;
while (a < b){
  a = a + 1;
a = a + b;


  t1 := a < b;
  IfZ t1 Goto L2;
  a := a + 1;
  Goto L1;
  a := a + b;  

So the three address code rules are pretty simple,It is possible to pass relatively simple conversion rules,就能从AST生成TAC.Here three address code is mainly used to describe the optimization algorithm,Because it is more concise and easy to read,操作(指令)的类型很少,Writing also conform to the daily habits.But do not use it to generate assembly code here,Because it contains the details of the information or less,比如整数是16、32还是64位的?What is target machine's architecture and operating system?What is the layout of the generated binary, etc.

这里会用LLVM的IRto undertake the task of generating assembly,Because it has the ability to describe and target machine(CPU、操作系统)more specific information,Generate object code accurately,so that it can really be used in a production environment.在正式开始LLVM的介绍前,More about other kindsIR的格式:

(1)四元式.It is equivalent to the three address code in another way,格式是:(OP,arg1,arg2,result)所以,“a := b + c”就等价于(+,b,c,a).

(2)逆波兰表达式.it puts the operator after,所以也叫做后缀表达式.“b + c”对应的逆波兰表达式是“b c +”;而“a = b + c”对应的逆波兰表达式是“a b c + =”.Reverse polish expression is particularly suited to use stack for calculation.比如计算“b c +”,Pop the plus sign from the stack first,Know to do addition operation,And then pop two operands from the stack,Perform add operation.The calculation process with depth-first traversalAST是等价的.So using reverse polish expression,Can use a very simple way can realize the formula to calculate function,If write software with the function of formula can consider to use it.而且从ASTGenerating Reverse Polish expressions is also very easy.

16. Main three address code is a tool for learning algorithm,Or for implementing a simpler backend,To implement an industrial-grade backend,充分发挥硬件的性能,还要学习LLVM的IR.LLVM汇编码(LLVM Assembly)是LLVM的IR.Sometimes it is simply calledLLVM语言,So you can put onLLVMA program file written in assembly code is calledLLVM程序.后面会详细提到LLVM,Here is to understand its core that isIR.

首先,LLVMAssembly code is to use static single assignment code in the form of.Put some more restrictions on the three-address code,another important code,即Static single assignment code(Static Single Assignment,SSA),In the static single assignment code assigned to once with a variable can only be,例如“y = x1 + x2 + x3 + x4”The common three-address code is as follows:

y := x1 + x2;
y := y + x3;
y := y + x4;

其中yAssigned three times,如果写成SSA的形式,Can only be written like the following:

t1 := x1 + x2;
t2 := t1 + x3;
y  := t2 + x4;

Why bother to write it this way,Need to add moret1和t2临时变量?原因是SSAThe form reflects the precise“使用 - 定义”关系.Each variable is certainly will be defined only once,Can then be used multiple times,This characteristic makes based onSSAEasier to do data flow analysis,The data flow analysis is the basis of a lot of code optimization techniques,so compilers for almost all languages、The interpreter or USES the virtual machineSSA,Because to do code optimization.而LLVM的IR也采用SSA的形式,也是因为SSAConvenient to do code optimization.

其次,LLVM IRHave more than three address code details.such as the word length of an integer variable、Memory alignment and so on,所以使用LLVM IRCan more accurately translated into assembly code.例如下面这段C语言代码:

int fun1(int a, int b){
  int c = 10;
  return a + b + c;

对应的LLLMThe assembly code is as follows:

; ModuleID = 'fun1.c'
source_filename = "fun1.c"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.14.0"
; Function Attrs: noinline nounwind optnone ssp uwtable
define i32 @fun1(i32, i32) #0 {
  %3 = alloca i32, align 4        //为3A variable application space
  %4 = alloca i32, align 4     
  %5 = alloca i32, align 4
  store i32 %0, i32* %3, align 4  //参数1赋值给变量1
  store i32 %1, i32* %4, align 4  //参数2赋值给变量2
  store i32 10, i32* %5, align 4  //常量10赋值给变量3
  %6 = load i32, i32* %3, align 4 //
  %7 = load i32, i32* %4, align 4
  %8 = add nsw i32 %6, %7
  %9 = load i32, i32* %5, align 4
  %10 = add nsw i32 %8, %9
  ret i32 %10
attributes #0 = { noinline nounwind optnone ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 10, i32 14]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 7, !"PIC Level", i32 2}
!3 = !{!"Apple LLVM version 10.0.1 (clang-1001.0.46.4)"}

The code looks really more complicated than the three address code,But more than the assembler to streamline,比如LLVM IRThe number of instructions evenx86-64less than one tenth of the get acquainted with the elements:

(1)模块.LLVM程序是由模块构成的,这个文件就是一个模块.Modules can contain functions、Global variables and the symbol table entry.When linking, each module will be spliced ​​together,To form an executable file or a library file.在模块中,You can define the target data layout(target data layout),For example, the lowercase at the beginning of the above assembly code“e”是低字节序(Little Endian)的意思,For more than one byte of data,The low address low byte in memory,高位字节排放在内存的高地址端

target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

The second line of the above assembly“target triple”Used to define the target host module,It including the architecture、厂商、The operating system three parts.

(2)函数.In the above assembly has adefineThe beginning of the function declaration,With the curly braces,这有点像C语言的写法,Than the assembly with the tag to represent a function more readable.Function declaration can bring a lot of modifier,Such as link type、调用约定等.如果不写,The default link type isexternal的,Is also can be exposed by other modules link.Calling conventions also there are plenty of options,缺省是“ccc”,也就是C语言的调用约定(C Calling Convention),而“swiftcc”则是 swift 语言的调用约定.This information is needed for generating assembly.The above function in the collectionfun1还带有“#0”的属性值,There are many properties defined in,These are also needed when generating assembly.

(3)标识符.分为全局(Glocal)和本地的(Local):全局标识符以@开头,including functions and global variables,The assembly of [email protected]就是;Local identifiers to%开头.Some identifiers have names such [email protected]或%a,Have a plenty of numerically can don't have a name,例如%1.

(4)操作码.alloca、store、load、add、retThese are the operation code.The meaning of them as shown in the figure below:

They are similar to the assembly seen earlier,But seems to be the body of the function of the code is a bit long,一个简单的“a+b+c”Translated into10多行代码,Also used to so many temp.But this is just an unoptimized format,Bring the optimization parameters after a little optimization,Will be cut to look like the following:

define i32 @fun1(i32, i32) local_unnamed_addr #0 {
  %3 = add i32 %0, 10
  %4 = add i32 %3, %1
  ret i32 %4

(5)类型系统.Assembly is a type of.如果用addInstruct it to think that the operation is an integer.而用fadd(或addss)指令,That operation is a floating point number.So there will be a type of unsafe risk,If the integer as a floating point number,The consequence is that the calculation result is completely wrong.LLVMAssembly with a type system.It can avoid unsafe data operation,and helps to optimize the algorithm.The type system includes primitive data types、函数类型和void类型.如下图所示:

Function types, including definition of return values and parameters,比如i32 (i32);voidType does not represent any value,There is no length.

(6)全局变量和常量.在LLVMCan declare a global variable in the collection.Global variables defined in the memory is allocated at compile time ok,而不是在运行时,For example, the following defines a global variableC:

@c = global i32 100, align 4

Also you can declare constants,Its value will not be modified at runtime,如下所示:

@c = constant i32 100, align 4

(7)元数据.In the above assembly also see to“!”At the beginning of some sentences,These are the metadata,It defines some additional information,To provide the optimizer and code generator using.

(8)基本块.In the function code is divided into basic block one by one,可以用标签(Label)To tag a basic block.下面这段代码有4个基本块,One of the first piece has a default name“entry”,Is as a basic block the entrance to the,This basic block does not have to label it:

define i32 @bb(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i32, align 4
  store i32 %0, i32* %3, align 4
  %4 = load i32, i32* %3, align 4
  %5 = icmp sgt i32 %4, 0
  br i1 %5, label %6, label %9

; <label>:6:                                      ; preds = %1
  %7 = load i32, i32* %3, align 4
  %8 = mul nsw i32 %7, 2
  store i32 %8, i32* %2, align 4
  br label %12

; <label>:9:                                      ; preds = %1
  %10 = load i32, i32* %3, align 4
  %11 = add nsw i32 %10, 3
  store i32 %11, i32* %2, align 4
  br label %12

; <label>:12:                                     ; preds = %9, %6
  %13 = load i32, i32* %2, align 4
  ret i32 %13

This code is actually equivalent to the followingC语言的代码:

int bb(int b){
    if (b > 0)
        return b * 2;
        return b + 3;

Each basic block is a set of instructions,Analyze the tags as9的基本块,To familiarize yourself with the basic block andLLVM指令的特点:

(1)第一行(%10 = load i32, i32* %3, align 4)的含义是:把3号变量(32位整型)从内存加载到寄存器,叫做10号变量,The memory alignment is4字节.这里延伸一下,When storing data in memory,有时会从2、4、8The address of integer multiple of bytes starts to store.Some assembly instructions require must be aligned address to collect the data from this.Other directives do not require,但If it is not aligned,比如从0x03Fetch the data address,Will spend more clock cycles.But the disadvantage of memory alignment is would waste memory space.The first line is the only entry for the entire basic block,From the other basic block jump up,Only to jump to the inlet line,cannot jump to other lines in the basic block.

(2)第二行(%11 = add nsw i32 %10, 3)的含义是:把10号变量(32位整型)加上3,保存到11号变量,其中nswis an addition calculation without sign wrapping(No Signed Wrap)的意思.

(3)第三行(store i32 %11, i32* %2, align 4)的含义是:把11号变量(32位整型)存入内存中的2号变量,内存对齐4字节.

(4)第四行(br label %12)的含义是:Jump to label as12的代码块.其中brinstruction is a terminating instruction.Put an end to the instruction or jump to another basic block,or return from the function(ret指令),The last line of the basic block must be a end instruction.

最后要强调,From the other basic block can not jump to the entry basic block,In the first basic function of.This provision is also conducive to data optimization..

五、The back-end technology reuse:LLVM

17. 在编译器后端,Do code optimization and the generated assembly code for each target platform,工作量是很大的,Need to use off-the-shelf tools.在前端部分,可以使用Antlr生成词法分析器和语法分析器.In the back-end part,也可以利用LLVM和GCCAfter the two side frame.LLVMIs an open source compiler infrastructure projects,Mainly focuses on the backend functionality of the compiler(代码生成、代码优化、JIT等).在AI领域的TensorFlow框架,In the back-end is withLLVM来编译,It the machine learningIR翻译成LLVM的IR,And then translated into supportCPU、GPU和TPU的程序.

与LLVMA backend compilation framework that plays a similar role isGCC(GNU Compiler Collection,GNU编译器套件).它支持了GNU LinuxIn many languages,例如C、C++、Objective-C、Fortran、Go和Java语言等.其实,It started as aC语言的编译器,Since the public back-end functionality also derived form the framework of,Supports multiple front-end languages ​​and back-end platforms.

接下来先来看看LLVMThe composition and characteristics of,LLVMAble to support multiple languages the front end of the、多种后端CPU架构.在LLVM内部,Using typed andSSA特点的IR进行各种分析、优化和转换,如下图所示:

LLVMThe project contains a lot of component:

(1)LLVM核心(core).It is the optimization and analysis tool in the picture above,Also includes for all kinds ofCPUFunction to generate object code;These libraries isLLVM IR,A well-defined intermediate language,The previous part has already had a preliminary understanding of it.

(2)Clang前端(是基于 LLVM 的 C、C++、Objective-C 编译器).



LLVMHave a good modular design and interface.Before the compiler backend technology hard to reuse,而LLVMTo define a good interface library,Convenient for the user to choose in what time,Which backend functions to reuse.Such as for code optimization,LLVMProvides many algorithm,Language designers can choose the appropriate algorithm,Or achieve their special algorithm,具有很好的灵活性.LLVM同时支持JIT(即时编译)和AOT(提前编译)两种模式.The language of the past are either interpreted,Run or compiled.Accustomed to using interpreted languages the programmer,It is difficult to used to have to wait for a compile time can see running effect,习惯在一个REPLInterface in the side to write script,See feedback in real time.LLVM既可以通过JITTechnical support interpretation and execution,Can fully compiled to perform again,This is attractive to language designers.

而且,LLVMSupport the optimization of the whole process in design,Including the compile time、链接时、装载时,甚至是运行时.Take runtime optimization as an example,基于LLVMIn the run time,Collect some performance data related to the code to compile optimization,Can be optimized in real time、Dynamic changes in memory machine;Can also collect the performance data,Then I do offline optimization,重新生成可执行文件,Then load the perform,in modern computing environment,Each functional characteristics are not the same,Really need to do different optimization for different scenarios.下图展现了这个过程:

18. 可以使用LLVM自带的命令行工具,Take steps experienceLLVM的功能:



(3)From the text formatIR生成二进制的字节码;

(4)把IRCompile into assembly code and executable.

从C语言代码生成IR代码比较简单,I have used one beforeC语言的示例代码,如下所示:

int fun1(int a, int b){
    int c = 10;
    return a+b+c;

Use front-end toolsClangit can be compiled intoIR代码,命令如下:

clang -emit-llvm -S fun1.c -o fun1.ll

其中,-emit-llvm参数告诉Clang生成LLVM的汇编码,也就是IR代码(如果不带这个参数,Is generated on the target machine assembly code),生成的LLVM IR如下:

; ModuleID = 'function-call1.c'
source_filename = "function-call1.c"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.14.0"

; Function Attrs: noinline nounwind optnone ssp uwtable
define i32 @fun1(i32, i32) #0 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  %5 = alloca i32, align 4
  store i32 %0, i32* %3, align 4
  store i32 %1, i32* %4, align 4
  store i32 10, i32* %5, align 4
  %6 = load i32, i32* %3, align 4
  %7 = load i32, i32* %4, align 4
  %8 = add nsw i32 %6, %7
  %9 = load i32, i32* %5, align 4
  %10 = add nsw i32 %8, %9
  ret i32 %10

attributes #0 = { noinline nounwind optnone ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"PIC Level", i32 2}
!2 = !{!"clang version 8.0.0 (tags/RELEASE_800/final)"}

Mentioned earlier can be generated by theIR做优化,Make the code shorter,As long as in the above command plus-O2参数就可以了(At this time to complete the function of the above steps for the second step):

clang -emit-llvm -S -O2 fun1.c -o fun1.ll

This time the body of the function of the core code is much shorter.It is the most important optimization action,From the original use of memory(allocaInstructions are allocated in the stack space,storeInstructions are written into the memory),Optimized to use only registers(%0、%1是参数,%3、%4也是寄存器),如下所示:

define i32 @fun1(i32, i32) #0 {
  %3 = add nsw i32 %0, %1
  %4 = add nsw i32 %3, 10
  ret i32 %4

还可以用optOrder to complete the above optimization,The concrete in the section about optimization algorithm to refine.另外,LLVM的IR有两种格式,In the example code is shown in its text format,文件名一般以.ll结尾;The second format is bytecode(bitcode)格式,文件名以.bc结尾.Why use two formats?Because the text format file for programmers to read,The bytecode format is a binary file,便于机器处理,Such as instant compilation and execution.After generating the bytecode format,Will take up much less space,So you can quickly loaded into memory,and convert to in-memory object format.while if loading a text file,It will take a parsing process,To become a memory of the format,效率比较慢.调用llvm-as命令,Can convert text to bytecode format:

llvm-as fun1.ll -o fun1.bc

也可以用clangDirectly generated bytecode,At this moment do not need to bring-SParameters but with-c参数,如下所示:

clang -emit-llvm -c fun1.c -o fun1.bc

因为.bc文件是二进制文件,cannot be viewed directly with a text editor,而要用hexdump命令查看(At this time to complete the function of the above steps in the third step):

hexdump -C fun1.bc

LLVM的一个优点,That is, you can compile and run bytecode on the fly,Don't have to compile the generated assembly code and an executable file to run(这一点有点像Java语言),这也让LLVM具有极高的灵活性,For instance at runtime according to collect performance information,Change the optimization strategy,Generate more efficient machine code.再进一步,Target platform to compile bytecode into the assembly code,使用的是llc命令,如下所示:

llc fun1.bc -o fun1.s

用clangCommand can also generate assembly code from the bytecode,要注意带上-S参数就行了,如下所示:

clang -S fun1.bc -o fun1.s

In this step has been the assembly code,Then you can further generate the target files and executable files.实际上,使用LLVMFrom the source code to generate an executable file has two possible paths:

(1)第一条路径,Is compile each source file into a bytecode file,Then compile it into an object file,最后链接成可执行文件.

(2)第二条路径,Is the first compiled bytecode file link together,To form a larger bytecode file,Then the optimization of the bytecode file for further,After generating the target file and executable file.

The second path more than the first path an optimization step,The first path to each module only optimized,To do the overall optimization.所以,Use the second route if possible,This results in more optimized code.Now complete the function of the above steps step 4,总结一下,The command line tools used include:clang前端、llvm-as、llc,其他命令还有opt(代码优化)、llvm-dis(Decompile the bytecode back toll文件)、llvm-link(链接)等.

六、生成IR:Realize the statically compiled languages

19. The above has been preliminarily understoodLLVM和它的IR,Also to be able to use the command line tools.However, it is still necessary to write a program to generateLLVM的 IR,这样才能复用LLVM的功能,To achieve a complete language.不过,If you want to like generated in front of the assembly language,Generated by string concatenationLLVM的IR,除了要了解LLVM IRMany of the details of the outside,The code must be more verbose and complex,Because the string concatenation is not structured way,So it is better to a well-defined data structure to representIR.

好在LLVMProject has been considered that,It provides a representativeLLVM IRA group of object model,As long as to generate these objects is equivalent to generateIR,This difficulty is much lower.而且LLVM还提供了一个工具类IRBuilder,It can be used to further enhance the creation ofLLVM IRThe efficiency of the object model of.

接下来,就先来了解LLVM IR的对象模型.LLVMIn internal usefulC++The implemented object model can fully representLLVM IR,When the byte code read into memory,LLVMwill build the model in memory.Only based on the object model can do further work,包括代码优化、Real-time compile and run、As well as the static compiler to generate the target file.所以说,The object model isLLVM运行时的核心,如下图所示:

IRThe header file in object modelinclude/llvm/IR目录下,Among the most important classes include:

(1)Module(模块).ModuleClass aggregates all the data in a module,it can contain multiple functions.可以通过Model::iteratorTo traverse all the functions in the module,It also includes a module's global variables.

(2)Function(函数).FunctionContains and function definitions(definition)或声明(declaration)About all the objects.Function definition contains the function body,While the function declaration only contains the function prototype,It is defined in other modules,In this module to use.可以通过getArgumentList()method to get a list of function arguments,Can iterate through all the basic block in the function body,These basic block will form aCFG(控制流图),如下所示:

//函数声明,没有函数体.This function is defined in another module,In this module to use
declare void @foo(i32)

//函数定义,Contains the body of the function
define i32 @fun3(i32 %a) {
  %calltmp1 = call void @foo(i32 %a)  //调用外部函数
  ret i32 10

(3)BasicBlock(基本块).BasicBlock封装了一系列的LLVM指令,可以借助bigin()/end()Model traversal of these instructions,还可以通过getTerminator()method to get the last instruction(The end instruction).You can also use several auxiliary method inCFG中导航,Such as to get a basic block the preamble of basic block.

(4)Instruction(指令).Instruction类代表了LLVM IR的原子操作(That is an instruction),可以通过getOpcode()to get the opcode it represents,它是一个llvm::Instruction枚举值,可以通过op_begin()和op_end()method pair to get the operands of this instruction.

(5)Value(值).ValueClass represents a value.在LLVM的内存IR中,If a class is fromValue继承的,Means that it defines a value,Can be used to other parties.函数、Basic block and instructions are inheritedValue.

(6)LLVMContext(上下文).这个类代表了LLVMA context in which compilation work is done,Contains a compilation of some global data,For example, the constants and types used by each module.

这些内容是LLVM IRThe main part of the object model,生成IR的过程,is dealing with these classes.

20. 接下来,With the program to generateLLVM的IR.Just mentioned eachLLVM IRClasses can be programmed to build.So for the followingfun1()函数生成IR,应该怎么办呢:

int fun1(int a, int b){
    return a+b;

(1)第一步,Can be to generate aLLVM模块,Also is the topIR对象.如下所示:

Module *mod = new Module("fun1.ll", TheModule);

(2)第二步,Continue to define functions in modulesfun1,因为Module are the main components function.But before defining the function,To define the prototype of the function first(或者叫函数的类型).The types of functions are described in the front-end section:If two functions return the same value,并且参数也相同,The type of these two functions are the same,So that you can do a function pointer type or function variable assignment.The function prototype of the sample code above is:返回值是32位整数,参数是两个32位整数.Have the function prototype later,You can use the function prototype defines a function.You can also set a name for each parameter,To facilitate the back reference this parameter,如下所示:

vector<Type *> argTypes(2, Type::getInt32Ty(TheContext));
FunctionType *fun1Type = FunctionType::get(Type::getInt32Ty(TheContext), //返回值是整数
      argTypes, //两个整型参数
      false);   //Not longer parameters

Function *fun = Function::Create(fun1Type, 
      Function::ExternalLinkage,   //链接类型
      "fun2",                      //函数名称
      TheModule.get());            //所在模块
string argNames[2] = {"a", "b"};
unsigned i = 0;
for (auto &arg : fun->args()){

这里需要注意,How variable types are used in code.The basis of all types are defined in advance,可以通过Type类的getXXXTy()方法获得.

(3)第三步,创建一个基本块.This function is only a basic block,You can have a name for it“entry”,You can also leave it unnamed.After creating the basic block,With an auxiliary classIRBuilderSet up an insertion point,After the sequence generated by the command will be inserted into the basic block of(IRBuilder是LLVM为了简化IRA helper class provided by the build process),如下所示:

BasicBlock *BB = BasicBlock::Create(TheContext,//上下文
               "",     //基本块名称
               fun);  //所在函数
Builder.SetInsertPoint(BB);   //Setup instruction insertion point

(4)第四步,生成"a+b"The expression of theIR,Insert into basic block.a和b都是函数fun的参数,Put it out, respectively, assigned toL和R(L和R是Value).然后用IRBuilder的CreateAdd()方法,生成一条add指令.This directive calculation results inaddtemp中,如下所示:

//The parameter variables toNamedValues里面备用
for (auto &Arg : fun->args())
    NamedValues[Arg.getName()] = &Arg;

Value *L = NamedValues["a"];
Value *R = NamedValues["b"];
Value *addtmp = Builder.CreateAdd(L, R);

(5)第五步,Take theaddtmp创建一个返回值,如下所示:


(6)最后一步,Check the correctness of this function.This is equivalent to doing a semantic check,比如,Basic block of the last statement must be a right of return instructions


The code for the above steps is formedcodegen_fun1()方法,Can call this method then print output generatedIR:

Function *fun1 = codegen_fun1();     //In the module to generateFunction对象
TheModule->print(errs(), nullptr);   //在终端输出IR


; ModuleID = 'llvmdemo'
source_filename = "llvmdemo"
define i32 @fun1(i32 %a, i32 %b) {
  %1 = add i32 %a, %b
  ret i32 %1

21. In order to familiar with moreAPI,The regeneration into one withif语句的IR,Then take a look at the functions include multiple basic block of.For example, generate for one of the following functionsIR:

int fun_ifstmt(int a)
  if (a > 2)
    return 2;
    return 3;

这样的一个函数,需要包含4个基本块:Entrance to the basic block、Then基本块、ElseBasic blocks and Merge基本块.控制流图(CFG)Is a separate first, then merge,像下面这样:

In the entrance basic block to calculate“a>2”的值,According to this value respectively to jump toThenBB和ElseBB.这里用到了IRBuilder的CreateICmpUGE()方法(UGE是”不大于等于“,也就是小于).The instruction of the return value is a1位的整型,也就是int1,如下所示:

Value * L = NamedValues["a"];
Value * R = ConstantInt::get(TheContext, APInt(32, 2, true));
Value * cond = Builder.CreateICmpUGE(L, R, "cmptmp");

Then create another3个基本块,并用IRBuilder的CreateCondBr()method to create a conditional jump instruction:当cond是1时跳转到ThenBB,0时跳转到ElseBB,如下所示:

BasicBlock *ThenBB =BasicBlock::Create(TheContext, "then", fun);
BasicBlock *ElseBB = BasicBlock::Create(TheContext, "else");
BasicBlock *MergeBB = BasicBlock::Create(TheContext, "ifcont");
Builder.CreateCondBr(cond, ThenBB, ElseBB);

If you are careful, you may find,在创建ThenBBSpecified when the function isfun,The other two basic block is not specified.This is because the nextThenBB生成指令,So first added to thefun中.Then add theElseBB和MergeBB到fun中,如下所示:

Value *ThenV = ConstantInt::get(TheContext, APInt(32, 2, true));

fun->getBasicBlockList().push_back(ElseBB);  //The basic block join the function
Value *ElseV = ConstantInt::get(TheContext, APInt(32, 3, true));

在ThenBB和ElseBBIn the code of these two basic blocks,Two values ​​were calculated separately:ThenV和ElseV.both of them may be the last return value,But the specific use which also look at the actual runtime,Flow control isThenBB还是ElseBB.这就需要用到phi指令,It completed according to control flow to select the appropriate value of task,如下所示:

//PHI节点:整型,Two candidate values
PHINode *PN = Builder.CreatePHI(Type::getInt32Ty(TheContext), 2); 
PN->addIncoming(ThenV, ThenBB);  //Preceding basic block isThenBB时,采用ThenV
PN->addIncoming(ElseV, ElseBB);  //Preceding basic block isElseBB时,采用ElseV


This code can see from above,if语句中phiInstruction is the key.Because when the program is controlled through a basic block too much,Each basic block may change a value,通过phiInstructions can know what's the runtime actually walk in the path,to get the correct value.最后生成的IR如下,其中的phi指令指出,If the preorder basic block isthen取值为2,是else时取值为3:

define i32 @fun_ifstmt(i32 %a) {
  %cmptmp = icmp uge i32 %a, 2
  br i1 %cmptmp, label %then, label %else

then:                                             ; preds = %0
  br label %ifcont

else:                                             ; preds = %0
  br label %ifcont

ifcont:                                           ; preds = %else, %then
  %1 = phi i32 [ 2, %then ], [ 3, %else ]
  ret i32 %1

Looping statements withif语句差不多,Because they are to involve more than one basic block,都要用到phi指令.

22. 在写程序时,A local variable is one of the essential elements,So put just a sample program changes,Use local variablesb保存ThenBB和ElseBB中计算的值,借此学习一下LLVM IRhow to support local variables.The changed sample program is as follows:

int fun_localvar(int a)
  int b = 0;
  if (a > 2)
     b = 2;
     b = 3;
  return b;

Now comes the challenge,在这段代码中bBe declared a,赋值了3次.前面知道LLVM IR采用的是SSA形式,Is assigned to once with each variable is only permitted to,So in the case of assignment for many times how to generateIR呢?其实,LLVMIt is stipulated that only a single assignment can be made to the register,And the memory of variables can be multiple assignment.对于“int b = 0;”,Generated with the following sentencesIR:

AllocaInst *b = Builder.CreateAlloca(Type::getInt32Ty(TheContext), nullptr, "b");
Value* initValue = ConstantInt::get(TheContext, APInt(32, 0, true));

Builder.CreateStore(initValue, b);

上面这段代码的含义是:首先用CreateAlloca()In the stack method applied to a piece of memory,用于保存一个32位的整型,接着用CreateStore()Method to generate astore指令,给b赋予初始值.A few words of generated aboveIR如下:

%b = alloca i32
 store i32 0, i32* %b

接着,可以在ThenBB和ElseBB中,Respectively to the memory ofb赋值,如下所示:

Value *ThenV = ConstantInt::get(TheContext, APInt(32, 2, true));
Builder.CreateStore(ThenV, b);

Value *ElseV = ConstantInt::get(TheContext, APInt(32, 3, true));
Builder.CreateStore(ElseV, b);





define i32 @fun_ifstmt.1(i32 %a) {
  %b = alloca i32
  store i32 0, i32* %b
  %cmptmp = icmp uge i32 %a, 2
  br i1 %cmptmp, label %then, label %else

then:                                             ; preds = %0
  store i32 2, i32* %b
  br label %ifcont

else:                                             ; preds = %0
  store i32 3, i32* %b
  br label %ifcont

ifcont:                                           ; preds = %else, %then
  ret i32* %b

当然,使用内存保存临时变量的性能比较低,but can be easily optimized by the algorithm,Take the above code from the memory-using version,Optimized using register version.

23. Now have been able to build in memoryLLVM的IR对象了,包括模块、函数、Basic blocks and various instructions.LLVMCan compile and execute this immediatelyIR模型.Next, create a new one with no parameters__main()函数作为入口,At the same time will take this example of extending the function call.function declared beforefun1,现在在__main()Function demonstrates how to call it,如下所示:

Function * codegen_main(){
    FunctionType *mainType = FunctionType::get(Type::getInt32Ty(TheContext), false);
    Function *main = Function::Create(mainType, Function::ExternalLinkage, "__main", TheModule.get());

    BasicBlock *BB = BasicBlock::Create(TheContext, "", main);

    int argValues[2] = {2, 3};
    std::vector<Value *> ArgsV;
    for (unsigned i = 0; i<2; ++i) {
        Value * value = ConstantInt::get(TheContext, APInt(32,argValues[i],true));
        if (!ArgsV.back())
            return nullptr;

    Function *callee = TheModule->getFunction("fun1");
    Value * rtn = Builder.CreateCall(callee, ArgsV, "calltmp");
    return main;

调用函数时,From first module known asfun1的函数,Ready to parameter values,然后通过IRBuilder的CreateCall()Method to generate a function call instruction.最后生成的IR如下:

define i32 @__main() {
  %calltmp = call i32 @fun1(i32 2, i32 3)
  ret i32 %calltmp3

接下来,The compiler invoke instant engine to run__main函数.使用这个JIT引擎,需要做几件事情:

(1)Initialize the associated with the target hardware platform Settings,如下所示:


(2)Add the created model toJIT引擎中,找到__main()函数的地址(整个过程跟CLanguage used in the function pointer to perform a function not much difference),如下所示:

auto H = TheJIT->addModule(std::move(TheModule));

auto main = TheJIT->findSymbol("__main");

int32_t (*FP)() = (int32_t (*)())(intptr_t)cantFail(main.getAddress());

int rtn = FP();

fprintf(stderr, "__main: %d\n", rtn);

(3)Programs can be executed successfully,并打印__main函数的返回值.

Now that you've demonstrated how to call a function,Revealed hereLLVMA surprising feature:可以在LLVM IRcall locally written functions,比如编写一个foo()函数,Used to print out some information:

void foo(int a){
    printf("in foo: %d\n",a);

然后就可以在__mainIn the direct call thisfoo函数,就像调用fun1函数一样,如下所示:

//Call an external functionfoo
vector<Type *> argTypes(1, Type::getInt32Ty(TheContext));
FunctionType *fooType = FunctionType::get(Type::getVoidTy(TheContext), argTypes, false);

Function *foo = Function::Create(fooType, Function::ExternalLinkage, "foo", TheModule.get());

std::vector<Value *> ArgsV2;
if (!ArgsV2.back())
    return nullptr;

Builder.CreateCall(foo, ArgsV2, "calltmp2");

注意,在这里只对fooFunction to do the statement,There is no definition of the function body,这时LLVMIn the external searchfoo的定义,It will find a useC++编写的foo函数,然后调用并执行;如果fooAnother target function in the file,它也可以找到.


24. Code optimization is one of the two largest work in the compiler backend(The other is code generation).代码优化的目标,Is to optimize the program for the use of computer resources.Usually the most concernedCPU资源,最大效率地利用CPUResources can improve the performance of the program.Code optimization sometimes there will be other targets,Such as the code size、内存占用大小、磁盘访问次数、Number of network communication, etc.From the object of code optimization,Most of the code optimization are inIR上做的,rather than at the previous stageASTAnd after a stage of assembly code,为什么呢?

其实,在ASTcan also do some optimization,Such as the compiler front end,will put some unnecessaryASTCut off(例如add->mul->pri->Int,Each parent node has only one child,can be directly reduced to aInt节点),But it's too abstract,Contains the hardware architecture of information too little,Difficult to perform many optimization algorithms. On the assembly code optimization makes algorithm associated with the machine,When change a target machine will rewrite optimization code.所以,IRIs the most suitable,It can try to be independent machine,At the same time exposed a lot of optimization opportunities.

Look from the optimization of the scope of the,分为本地优化、Global optimization and process optimization.Optimization is usually for a set of instructions,The most common and most important instruction set is basic block.The characteristics of basic block is:Each basic block can only be accessed from the entrance,exit from last instruction,Each instruction will be executed sequentially.Because of the characteristics of,It is more convenient when doing certain optimizations.For example, for the following basic block,can be safely3行的“y:=t+x”改成“y:= 3 * x”,因为tThe assignment must be iny的前面:

  t:=2 * x
  y:=t + x
  Goto BB2

Against the basic block optimization,叫做本地优化(Local Optimization).Then another problem is that you could change the second row“t:=2 * x”Also optimized to delete?It depends on whether other code will refer tot.Therefore, a wider analysis is required,Can decide whether to drop the second line optimization.Beyond the scope of basic block is analyzed,需要用到控制流图(Control Flow Graph,CFG).CFG是一种有向图,It embodies the basic block instruction flow relationship between.如果从BB1The last instruction is to jump toBB2,那么从BB1到BB2There is an edge.一个函数(或过程)if it contains multiple basic blocks,可以表达为一个CFG,如下图所示:

If by analyzingCFG,发现tAre not used in other places,delete the second line.This for a function、基于CFG的优化,叫做全局优化(Global Optimization).Than the broader global optimization optimization,叫做过程间优化(Inter-procedural Optimization),It can cross function boundaries,To optimize the relationship between the multiple functions,rather than optimizing for just one function.

But don't need to do code optimization thoroughly every time,Because do code optimization itself also need to consume the resources of the computer.所以,Need to weigh the benefits of code optimization and the optimization itself the cost of these two aspects,Then determine how much optimization to do.比如,In the browser to loadJavaScript时,JavaScriptThe engine will do optimization about grammar,But if the optimal consumption too long interface response will be slow,On the contrary, it affects the user's experience of using the page,所以JavaScriptEngine do optimization to grasp the right degree or adjust the timing.

Then know some common code optimization scenario:

(1)代数优化(Algebraic Optimazation).It is one of the most simple optimization,When the operator is algebraic operation,According to mathematical knowledge is optimized.比如“x:=x+0”这行代码,操作前后x没有任何变化,So the code can delete;又比如“x:=x*0”可以简化成“x:=0”;For some machine shift operation speed is faster than the multiplication,那么“x:=x*8”可以优化成“x:=x<<3”.

(2)常数折叠(Constant Folding).It refers to the operation of constant can be calculated at compile time,比如 “x:= 20 * 3”可以优化成“x:=60”.另外在if条件中,if the condition is a constant,Then you can certainly take a branch.比如:“If 2>0 Goto BB2”可以简化成“Goto BB2”就好了.

(3)Delete the inaccessible basic block.Some codes can never be activated,Such as in conditional compilation of scenario would write such a program:“if(DEBUG) {…}”.如果编译时DEBUG是一个常量false,Then this block of code does not need to be compiled..

(4)删除公共子表达式(Common Subexpression Elimination).下面这两行代码,x和 yThe form on the right is the same,If between these two lines of code,a和b的值没有发生变化(比如采用SSA形式),那么x和yValue must be the same:

x := a + b
y := a + b


x := a + b
y := x

(5)copy transmission(Copy Propagation)和常数传播(Constant Propagation).下面的代码中,The third line can be replaced with“z:= 2 * x”, 因为y的值就等于x,This is called a spread copy:

x := a + b
y := x
z := 2 * y

如果y := 10,常数10Also can be transmitted down,The last line to replacez:= 2 * 10,This is called a constant one more constant fold,就变成z:=20了.

(6)死代码删除(Ded code elimination).in the above copy spread,If there is no other place to usey变量了,then the second line is dead code,就可以删除掉.

One optimization can lead to another,Such as copy propagation cause y不再被使用,It can also be optimized for dead code removal.所以Generally, multiple optimizations are performed、多次扫描.

25. The above optimization scenarios,Can be used for local optimization、Global optimization and process optimization.Look at how to do the local optimization in here,We'll go on with the talk behind global optimization.Suppose the following code is a basic block(omit final termination instruction):

a := b
c := a + b
c := b
d := a + b
e := a + b

为了优化它们,Method is a“可用表达式(available expression)”的集合.An available expression means that there is a variable,Holds the value of an expression.Compute this set sequentially from top to bottom:

(1)Starting with an empty set.

(2)经过第一行代码后,Increase in the collection“a:=b”;

(3)After the second line of code,增加了“c:=a+b”.

(4)After the third line of code,由于变量c的定义变了,所以“c:=a+b”不再可用,而是换成了“c:=b”,如下图所示:

(5)能看到代码“e:=a+b”,和集合中的“d:=a+b”The right part of the equal sign is the same,So first you can remove common subexpression,优化成“e:=d”.变成下面这样:

(6)Then spread can make a copy,利用“a:=b”,The expression of multiplea都替换成b,如下图所示:

到目前为止,a都被替换成了b,对eThe calculation of also simplifies the,The optimized code becomes the following:

a := b
c := b + b
c := b
d := b + b
e := d

观察一下这段代码,It seems that there is still room for optimization,For example there are dead code,可以将其删除.Suppose in the basic block of the order,b和c仍然会被使用,But other variables will not be used anymore.那么上面这5Which line of code can be deleted??这时要做活跃性分析(Liveness Analysis).A variable is alive,Means that its value before the change will be other code to read(对于SSA格式的IR,Variables defined after won't change,So as long as see the back of the code have to use this variable is ok).Here, the activity of each variable will be analyzed,The death of variable delete.

This is to use a collection,But after this collection from forward、Reverse calculation of,如下图所示:

To start with the elements in the collection is{b, c},This is the initial valueb和cwill be used by the code behind,So they are alive.

(1)扫描过“e := d”后,因为用到了d所以d是活的,结果是{b, c, d}.

(2)再扫描“d := b + b”用到了b,But in the collection has ab了;这里给d赋值了,Has already met the code behind thed的要求,So can be removed from the collectiond了,结果是{b,c}.

(3)再扫描“c := b”,Remove from the collectionc结果是{b}.

(4)Continue scanning until the first line,The last collection is still{b}.

Based on the set right now,Can work as a dead code to delete.When assigning a value to a variable,It is followed by a collection of no this variable,It is not needed,就可以删掉了.Outlined in red three lines are dead code,都可以删掉:

After deleting, only two lines of code remain.注意由于“e := d”被删掉了,导致dAlso no longer need to,Turned out to be dead variable,如下图所示:

把变量d删掉以后,just one line of code left“c := b”了:

This far, and completed the whole optimization process,5Lines of code optimization is1行代码.Summarize the optimization process:

(1)首先To do a positive scanning,Perform available expression analysis,We build the expression of available collection,And then the whole set with reference to the common subexpression,Spread and do copy.

(2)接着To do a reverse scan,Activity analysis,Build a collection of live variables identify dead variable,According to delete it to death variable assignment code.

(3)The above optimization may need to be done more than once,才能得到最后的结果.

Do the optimization of the above is based on a sequential code,没有跳转,belong to a basic block,Belongs to the local optimization.Local optimization is in progress,Available expression and activity analysis can be considered by the following4个元素构成的:

(1)D(方向).whether to traverse forward or backward.

(2)V(值).Code of every place to compute a value.Available value of the expression and activity analysis is a collection,There are also some analyzed values ​​that are not sets.

(3)F(转换函数,对V进行转换).Such as when doing available expression analysis,遇到“c := b”时,The set of available expressions starts from{a := b, c := a + b}转换成了{a := b, c := b}.The conversion rules obeyed here are:因为变量c被重新赋值了,Then the variables from the collectioncThe original definition to remove,并把带有cRemove the expression,因为过去的c已经失效了,然后把变量cAdded a new definition.

(4)I(初始值,At the beginning of the algorithm isV的取值).Do available expression analysis of the initial value is an empty set.When doing activity analysis of the initial value is behind in the code will also visit the variables,Also is the live variables.

26. After this formalization,Can be in accordance with the unified the model to understand all kinds of local optimization algorithm.接下来熟悉一下LLVM的优化功能.Used in front of theClang命令带上O2parameters to generate optimizedIR:

clang -emit-llvm -S -O2 fun1.c -o fun1-O2.ll

实际上,LLVMThere is also a separate commandopt来做代码优化.By default its input and output are.bc文件,所以还要在.bc和.llConvert between two formats,如下所示:

clang -emit-llvm -S fun1.c -o fun1.ll  //生成LLVM IR
llc fun1.ll -o fun1.bc                 //编译成字节码
opt -O2 fun1.bc -o fun1-O2.bc          //做O2级的优化
llvm-dis fun1-O2.bc -o fun1-O2.ll      //Decompile bytecode into text format

One thing to note,Is the first line command producesfun1.ll文件中的“optone”这个属性去掉,Because of this it means don't code optimization.You can also simplify the above,给opt命令带上-SParameters directly on.ll文件进行优化:

opt -S -O2 fun1.ll -o fun1-O2.ll

-O2represents the second-level optimization,LLVMDefined in multiple optimization level,Basically, the greater the number of optimization, the more.Can not use general optimization level,It specifies the use of a particular optimization algorithm,比如mem2regAlgorithm will get access to memory optimization into access registers as far as possible,如下所示:

opt -S -mem2reg fun1.ll -o fun1-O2.ll

For constant folding,在调用API生成IR时LLVMThis optimization will be done by default.For example the following code is to return2+3的值,但生成IRWhen directly into a5,Because this optimization is simply don't need to do the analysis of the complex:

Function * codegen_const_folding(){
    FunctionType *funType = FunctionType::get(Type::getInt32Ty(TheContext), false);
    Function *fun = Function::Create(funType, Function::ExternalLinkage, "const_folding", TheModule.get());

    BasicBlock *BB = BasicBlock::Create(TheContext, "", fun);

    Value * tmp1 = ConstantInt::get(TheContext, APInt(32, 2, true));
    Value * tmp2 = ConstantInt::get(TheContext, APInt(32, 3, true));
    Value * tmp3 =  Builder.CreateAdd(tmp1, tmp2);

    return fun;


define i32 @const_folding() {
  ret i32 5

需要注意,Many optimization algorithms are based on register variables to do,So usually do first-mem2reg优化.在LLVMDo the optimization algorithm is very convenient,因为它采用的是SSA格式.具体来说,LLVM中定义了Value和User两个接口,它们体现了LLVM IR最强大的特性,The definition of static single assignment-使用链,这种定义-The usage relationship will be used in the optimization algorithm.前面已经讲过了Value类.If a class is fromValueInheritance means that it defines a value.另一个类是User类,Function and the instruction is alsoUser类的子类,That is to say, in the function and instruction,Values ​​defined elsewhere can be used,如下图所示:

These two classes is how to help the optimization algorithm of?在UserHave access to all it used inValue,For example, a add instruction(%c = add nsw i32 %a, %b)用到了a和b这两个变量.而在ValueYou can access all objects that use this value inUser,比如给cThe assignment of this instruction.所以可以遍历一个Value的所有UserTo replace it into anotherValue,This is the spread of copy.

接下来,See how to do it programmaticallyIR的优化.在LLVM内部,Optimize work through one by onePass(遍)来实现的,它支持三种类型的Pass:

(1)一种是分AnalyticalPass(Analysis Passes),Just doing some analysis results for sequence after operation.

(2)Some are doneCode conversion(Transform Passes),Such as do common subexpression delete.

(3)还有一类passIs a tool type,Such as to correctness verification module to do.

下面的代码创建了一个PassManager,and added two optimizationsPass:

// 创建一个PassManager
TheFPM = std::make_unique<legacy::FunctionPassManager>(TheModule.get());

// Peephole optimization and some calculation optimization

// Heavy expression associated


之后,Simply call againPassManager的run()method to optimize the code:



27. The removal of common subexpressions is mentioned above、Copy propagation such as local optimization can do,In fact this a few work can also be done in global optimization.Just a global optimization algorithm of,Not like in the local optimization only for a basic block,But the more complicated,Because want to cover multiple basic block.These basic blocks form aCFG,Code at runtime a variety of possible execution paths,This creates value under multipath calculation problem,Such as active variable set calculation.Of course there are some optimization can only in global optimization to do,Can't do it in local optimization,比如:

(1)代码移动(code motion)Able to code from a basic block to move to another basic block,For example, move from inside the loop to outside the loop,To reduce the unnecessary calculation.

(2)Some redundancy delete(Partial Redundancy Elimination),It can take a basic block are deleted.

总之,Global optimization than local optimization can do work more,Analysis algorithms are also more complex,因为CFGThere may be multiple execution path.不过,In the above mentioned local optimization algorithm ideas on,Solve the multipath situationVThe calculation problem of value.而这种基于CFGWith the method of optimization analysis framework,data flow analysis.Speak in front of the activity at the local optimization analysis,When situation is simple,Don't need to consider multipath problem.而在When do the global optimization,情况就要复杂一些:Code is not simply in a basic block order,and possibly via a control flow graph(CFG)中的多条路径.来看一个例子(由ifStatement which formed the two branch):

基于这个CFGGlobal activity analysis can do,Starting from the bottom of basic block,Calculates the set of active variables backwards and forwards(That is, from basic block5Backwards to basic block1计算).这里需要注意,for basic blocks1进行计算时,Its input is the basic block2The output of which is{a, b, c},和基本块3The output of which is{a, c},The result of the calculation is the union of these two sets{a, b, c}.That is basic block1After the order of the basic block,It is possible to use these three variables.Here's where it differs from local optimization,To be calculated based on multiple paths,如下图所示:

Based on the analysis,马上发现yVariables can be deleted(Because live variables set in front of it{x}不包括y,That is, it is not used by the code behind),and affects the set of active variables,如下图所示:

删掉yVariable will continue to optimize the later round,会发现d也可以删掉.如下图所示:

d删掉以后,2There is no code in the basic block,Can also be deleted,最后的CFG是下面这样:

So far found:Overall global optimization with local optimization is very similar,The only difference is the contents of the collection were calculated based on multiple branch(也就是V值).In the basic block1时,2和3Two branches met(meet),取了2和3的V值的并集,This is the basic characteristic of data flow analysis.但是,上面这个CFG还是比较简单的,Because it has no cycle belongs to the directed acyclic graph.The characteristics of this figure is:for each node in the graph,Will always find its sequence nodes before and after the sequence node,So I just need to be calculated in accordance with the order.But if you add a loop,就不那么简单了,Take a look at the picture below:

基本块4There are two sequence nodes after,分别是5和1,所以要计算4active variable,就需要知道5和1的输出是什么.5The output is easy to say,但1Haven't calculated,因为要计算1就要依赖2和3,thus indirectly dependent on4.这样一来,1和4Is the circular dependencies.Further exploration found,其实1、2、3、4All four nodes are circularly dependent.所以说,一旦在CFGIntroduced in the loop,Strict calculation before and after the order is no.

In front of calculationFirst和Follow集合时,Will meet the condition of the circular dependencies,But there is no such a carefully analysis.不过,The time is to use不动点法To break the deadlock of,Here is to use the fixed point method,具体操作是:For each basic blockVValues are assigned initial value,也就是空集合,如下图所示:

Then do multiple calculations for all nodes,Until all set stability.For the first time according to the5-4-3-2-1的顺序计算(Virtually any order will do),计算结果如下:

If the calculation ends now,In fact, the basic block can be2中的d变量删掉.But if again according to the5-4-3-2-1The order of the computation again,Will add some new elements into the collection(The drawing is red).This is because the calculation of basic block4时,基本块1的输出{b, c, d}也会变成4的输入.At this time, it is found that the basic block is entered2时,The set of live variables containsd的,所以d是不能删除的,如下图所示:

再仔细看,这个dYes is the basic block3需要的:它会跟1去要,1会跟4要,4跟2要.So once again,1、2、3、4Four nodes is depend on each other.再来看一下,For live variables set calculation,When two branches meet under the condition of,The end result took two branches and set.Mentioned before a local optimization analysis includes four elements:方向(D)、值(V)、转换函数(F)和初始值(I).When doing global optimization needs to add an element,Is the two branches to do an operation when they met,Calculating the value of their intersection,This operation can be used with uppercase Greek lettersΛ(lambda)表示.包含了D、V、F、I和Λ的分析框架,data flow analysis.

28. 那么Λ怎么计算呢?requires a mathematical tool,叫做“半格”(Semilattice).首先,A semilattice is a偏序集(Partially Ordered Set).Poset is set only some members will be able to compare the size,For example, when doing global liveness analysis,{a, b, c}和{a, c}To meet the new value is{a, b, c},Formally written{a, b, c}Λ{a, c} = {a, b, c}.这时候说{a, b, c}是可以跟{a, c}比较大小的.So which is bigger and which is smaller??如果 XΛY=X,就说X<=Y.所以{a, b, c}是比较小的,{a, c}是比较大的.

当然,{a, b, c}也可以跟{a, b}比较大小,But it can't follow{c, d}比较大小.So the contains a{ {a, b, c}、{a, c}、{a, b}、{c, d}…}Such a collection,叫做偏序集,Only part of them can compare the size between members.Which members can be compared??Is a semilattice diagram below,Can be connected by directional lines.Half grid can be drawn as graphics,Assume that the program isa, b, c三个变量,Then the half painted graphics is such:

Along the line in the figure above,The size of the two values can be,According to the direction of the arrow in order to reduce:{}>{a}>{a, b}> {a, b, c}.If there is no path between the two values,Then they just can't compare between the size of the,就像{a}和{b}Will not be able to compare the size.For this half,把{}(空集)叫做Top,Topgreater than all other values.而{a, b, c}叫做Bottom,It is the smallest value.When doing activity analysis,ΛThe operation is to calculate the maximum lower bound of two values(Greatest Lower Bound).is a value smaller than both original values,取最大的那个.{a}和{b}的最大下界是{a, b},{a, b, c}和{a, c}The greatest lower bound is{a, b, c},规则如下:

(1)If a partially ordered set,Any two elements have the greatest lower bound,Then this partially ordered set is called an intersecting half-lattice(Meet Semilattice).

(2)与此相对应的,If every element has a least upper bound of a set of(Least Upper Bound),And then the poset is called a semilattice(Join Semilattice).

(3)If a partially ordered set is both an intersecting semilattice,And half again,He said the poset is a,The above example of the poset is a(Lattice).

Why are the introduction of such a complex set of mathematical tools?Meet two branches will count them and set,不就可以了吗?事情没那么简单.因为Not all of the analysis,其VValues are a collection of,Even the collection,The operation when intersecting is not necessarily a union,And may be masked.Here by another case to analyze the collection a semilattice arithmetic:常数传播.

常数传播,Is if you know a variable value is a constant,Then use the expression that uses this variable,With the constant to replace.看看下面的例子,在基本块4中aThe value with a constant replacement?

答案是不能.Reach the basic block4的两条路径,一条a=3另一条a=4,Don't know when actual operation from which way,所以这个时候a的取值是不确定的,基本块4中的aCan't replace with constant.在这种情况下,VIs no longer a collection,而是aMay take the constant value of,但amay not be a constant,So to define a special value:Top(T).除了T之外,To introduce a andTThe corresponding special values:Bottom(含义是A statement is never perform).总结起来,Constant communicationV的取值可能是3个:


(2)Top:意思是aThe value is not a constant.

(3)Bottom:A statement will not be executed.

How are the values ​​sorted??最大的是Top,There is no comparison between each constant among,Bottom是最小的.接下来,Look at how to calculate the multipleVThe intersection of value.Then the calculation process to formalize the,In this analysis when after each statement,Vvalue may change,With the following two functions to represent the different parts of theV值:

(1)C(a, s, in).expressed in the statements之前a的取值,比如C(a, b:=a+2, in) = 3.

(2)C(a, s, out).expressed in the statements之后a的取值,比如C(a, a:=4, out) = 4.

如果sThe preorder hasi条可能的路径,Then multiple output and an input“C(a, si, out)和C(a, s, in)”的关系,A set of rules can be made:

(1)If there is an input path isTop,或者说C(a, si, out)是Top,那么结果C(a, s, in)就是Top.

(2)If there are two different constants in the input such as3和4,那么结果也是Top.

(3)If all of the input is the same constant orBottom,So the result is constant.For example if all pathsa的值都是3,So here you can safely thinka的值是3.那些BottomPath does not matter,Because the whole path does not perform.


上面的这4个规则,Is a set of the calculation of a semilattice rules.Here also can summarize its transformation rules also isF,Consider aStatement的in值和out值的关系,Namely after theStatement以后,VWhat value will there be changes:

(1)如果输入是Bottom,那么输出也是Bottom.Is this path will not pass.

(2)如果该Statement就是“a := 常数”,then the output is the constant.

(3)如果该Statement是aTo give a more complex expression rather than constant,那么输出就是Top.


This conversion functionF也搞清楚了.初始值I其实就是Top,Because the beginninga还没有赋值,So not is constant;方向D是向下的,这个时候D、V、F、I 和Λ这5Are you clear element,You can write the algorithm to achieve.


29. A formal compiler backend,Code generation part of the need to consider more closely to just go,其实主要有三点:

(1)指令的选择.A same function can be finished with different sequences of instructions or,The need to select more optimized solutions.

(2)寄存器分配.每款CPUThe registers are limited,To effectively use it.

(3)指令重排序.Calculate the execution sequence would affect the efficiency of the generated code.without affecting the results,To obtain higher efficiency through code reordering.

Next to the first question,Talk about why you need to select instruction,and how to choose an order.If the performance of the target code is not considered,May, in accordance with the very mechanical translation code.Such as make a template code translation,把形如“a := b + c”The code is translated into the following assembly code:

mov b, r0  //把b装入寄存器r0
add c, r0  //把c加到r0上
mov r0, a  //把r0存入a


a := b + c
d := a + e

Will be automatically translated into:

mov b, r0  
add c, r0  
mov r0, a  
mov a, r0  
add e, r0  
mov r0, d  

You can see from the above code,第4Line is redundant,因为r0的值就是aNo need to load it again.另外,If the back of the code does not usea(即a只是个临时变量),那么第3Line is redundant.The algorithm is correct no problem,But the amount of code is too big price is too high.So it is best for the optimization of the algorithm to generate more smarter code.This is one of the reasons to do instruction selection.

The second reason to do instruction selection is,Realize the same function can use a variety of instructions,特别是CISC指令集(Alternative choice a lot,But each has its applicable scenarios).对某个CPU来说,Accomplish the same tasks can adopt different instruction.比如实现“a := a + 1”Three codes can be generated,如下所示:

mov a, r0  
add $1, r0  
mov r0, a

You can also use one line of code directlyinc指令,And lowest overall cost to see what kind of way:

inc a


mov $0, r0   //Assignment for immediate number0
xor r0, r0   //异或操作
sub r0, r0   //With its own value to reduce

再比如,a * 7可以用a<<3 - a 实现:首先移位3位相当于乘8,Then subtract aa就相当于乘以7.Although use the two instructions,But may consume less total clock cycle.So what is the idea of ​​doing command selection??By far the most mature algorithms are based on tree cover method,Through an example to understand what is belowTree Covering Algorithm,a[i] = b这个表达式中,假设a和bare local variables on the stack,iIs on the registerri中.One can use this expressionAST表示,如下图所示:

May think the tree look likeASTBut not like,That's because there is amem节点(Meaning is deposited in the memory)、mov节点、栈指针 (fp).It can be classified as low(low-level)AST,是一种IR的表达方式,Sometimes referred to as structuredIR.这个ASTContains rich runtime details,相当于把LLVM的IRWith tree structure to represent the,Can be painted such a basic block instructions are the tree structure of.Based on the tree,Can be translated into the following assembly code:

load M[fp+a], r1 //Get the address of the beginning of the array,放入r1,fp是栈顶的指针,aIs the address of the offset
addi 4, r2       //把4加载到r2
mul ri, r2       //把ri的值乘到r2上,即i*4,i.e. the offset of the array element,每个元素4字节
add r2, r1       //把r2加到r1上,也就是算出a[i]的地址
load M[fp+b], r2 //把b的值加载到r2寄存器
store r2, M[r1]  //把r2写入地址为r1的内存

Here is a hypothetical assembly code,跟LLVM IR有点像,But more simplified、易读,如下图所示:

Use the method of tree cover can greatly reduce the amount of code,Surrounded by below in red line part is called a瓦片 (tiling),The contains operator tiles,Can be converted into an instruction.Each tile can cover multiple nodes,So the instructions generated less:

What are the tiles used for??Each machine instruction will be correspondingIR的一些模式(Pattern),Can be represented as a small tree,And the young trees can as tiles,如下图所示:

The algorithm can traverse theAST,Meet the above model can generate the corresponding instruction.以load指令为例,It has several mode:Add a constant to any node,This is equivalent to assembly language of indirect address access;或者memdirectly below is a constant,This is so direct address access.最后,Address values can also be calculated by the child nodes at a lower level, come out.

所以,从一棵AST生成代码的过程,Is to use these young trees to match a tree,And the whole process of tree cover,So it's called tree-covering algorithm.上面图中2、4、5、6、8、9These nodes generate assembly code in turn.要注意的是,There may be more than one coverage,For example, the following coverage,Compared to the previous result,它在8和9There is a difference between two tiles:

The last two sentences so that the generated assembly code is different,如下所示:

load M[fp+a], r1  //Get the address of the beginning of the array,放入r1,fp是栈顶的指针,aIs the address of the offset
addi 4, r2        //把4加载到r2
mul ri, r2        //把ri的值乘到r2上,即i*4,i.e. the offset of the array element,每个元素4字节
add r2, r1        //把r2加到r1上,也就是算出a[i]的地址
addi fp+b, r2     //把fp+b的值加载到r2寄存器
movm M[r2], M[r1] //put the address as r2to copy the value to the address asr1内存里

可以体会一下,The difference between the two covered way:

(1)for tiles8中的加法运算,A as the indirect address calculation,One is to treat as addition;

(2)Operations on the root node,A translation intostore,把寄存器中的b的值写入到内存.A translation intomovm指令,copy values ​​directly between memory.As for the two translation methods which is better,Compare the overall performance which is better.

到目前为止,Already intuitively understand why select the instruction,and the most commonly used tree covering methods.Tree covering algorithm, of course, there are a lot of,比如Maximal Munch算法、动态规划算法、tree grammar, etc,LLVMAlso has its own algorithm.简单说下Maximal Munch算法,Literal translation into Chinese is the meaning of every time a big bite as far as possible,也就是从树根开始,Every one can cover up the node tiles,This forms several subtrees.Use the same strategy for each subtree,This will result in the least amount of instructions generated.Note that the order of instructions is reversed,按照深度优先的策略,Is first leaves and roots.这个算法是Optimal的算法,It refers to the local,Adjacent tiles can't connect into price lower tile.In addition to covering algorithmOptimal的还有Optimum的,Optimumis the state of global optimization,That is, the overall cost of the code is the lowest.

30. Then continue to explore the aforementioned second problem:寄存器分配.The task of register optimization is to:Get the most out of your registers,But not more than the register total quantity restrictions.因为生成IRdoes not know the information of the target machine,Also don't know target machine has several registers can be used,所以在IRAn unlimited number of temporary variables can be used in,Each temporary variable represents a register.Now that to generate a target machine code,also know this information,That is about to take the originalIR改写一下,In order to use when register not overweight.

What is the principle of register optimization??Such as below on the left side of theIR中,a、d、fThe three temporary variables do not appear at the same time.假设a和dbecomes a dead variable after this code block,So these three variables can share the same register,as shown on the right:

实际上,The three lines of code is for“b + c + e + 10”The expression of translation,所以a和dAre converted toIRWhen introducing intermediate variable,用完就不用了.所以通过这个例子,可以直观地理解The principle of register sharing:If there are two tempa和b,Them in the process of the whole program execution is only one variable is most active,Then the two variables can share the same register.

It has already been mentioned how to do the activity in variable analysis,So I can analyze,In any program, active variable set.然后再看一下,Which variables have never appeared in the same collection line.看看下面这个图:

上图中,Any variables appear in the same curly braces cannot share register,Because they are at some point is active at the same time.那a到fWhich variables have never met?Draw a diagram to look for,Each temporary variables as a node in the picture below,If two variables exist at the same time,Draw a side.This form of diagram is calledRegister interference figure (Register Interference Graph, RIG).在这张图里,Anyone without attachment of the two variables can be assigned to the same register,例如a和b,b和d,a和d,b和e,a和e:

That for the program,How many registers are needed in total,怎么分配呢?比较常用的算法是Figure dyeing algorithm:As long as there is connection between two nodes,Nodes are dyed different colors.Finally need at least the number of color is required to register.Under the picture two dyeing solution,都是需要4 种颜色:

So how to use the algorithm to dyeing?How to use the algorithm whether register enough?Coloring algorithm is actually,如果想知道kenough registers,Just find a less thankedge nodes,Remove it from the picture.Then find the next one less thankedge nodes,再去掉.If the whole picture is deleted at the end,Then this picture must be usablek种颜色来染色.如下图所示:

Because if a figure(蓝色椭圆)是能用kDye colors,Then add a node,it has fewer edges thankAn example isn,Then this bigger picture(红色椭圆)还是可以用 k 种颜色染色.Because the number of edges added to the node is less thank个,So be sure to find a color,With the point ofnNeighbors are not the same.所以,Remove the just one by one the order of the node, in turn,,Add nodes one by one to the graph,Add one each,It find a neighbor use color to dyeing line.

但是,If need to register more than the number of actual register,That is about to use the stack.这个问题就是寄存器溢出(Register Spilling),Overflow into the stack,像本地变量、参数、The return value and so on all use register,If the register is not enough, put it on the stack.In addition whether in register or stack in,Are part of the active record,所以活动记录This concept is broader than stack frame.依然是上面的例子,如果只有3个寄存器,Then to calculate the3enough registers,先把a和b从图中去掉,如下所示:

这时发现,剩下的4个节点,每个节点都有3个邻居,所以3certainly not enough registers,must overflow.可以选择让f保存在栈里,把f去掉以后,剩下的c,d,e可以用3Color dyeing success.fAlthough it is saved to the stack,但Every time you use it toloadTo a temporary variable,Is the register.每次保存fAlso want to use a temporary variable written to the memory.So we need to modify the original code,Each usefAdd a placeload或save指令,以便在使用f时把f放到寄存器,After use to write back to the memory.修改后的CFG如下:

Because the original4个地方用到了f,所以引入了f1到f4Four temp.In this way, the total temporary variables have become more,从6个到了9个.But although temporary variables more,But this a few temporary variable lifetime is very short,Pictured withfThe set of active variables is much smaller than before.所以,即使有9个临时变量,Can also be dyed in three colors,如下图所示:

最后,When choosing which variable overflow,It is best to choose the variable that is used the least number of times.You'd better don't circulate in the application of variable overflow,because they are used every time through the loop,Or in the register higher performance.

31. Understand the instruction selection and register allocation in front of the,Here to continue to explore the target code generated by the third dimension to consider:指令重排序(Instruction Scheduling).人们通常会把CPU看做一个整体,把CPUThe process of executing instructions to imagine this ticket station process,Change the order of the different passengers will not speed up the checking in.So naturally think change order will not change the total time.但当进入CPU内部,会看到CPUIs composed of multiple features.下图是Ice Lake微架构的CPU内部构成:

在这个结构中,An instruction is executed to use multiple features in turn,分成多个阶段,Although each instruction is executed sequentially,But after the work of each component is completed,Can be in the service of the next instruction,So as to achieve the effect of the parallel execution.This structure is called pipeline(pipeline)结构.比如典型的RISCInstruction in execution will be divided into before and after5个阶段:


(2)ID(或RF):Instruction decode and obtain the value of the register;


(4)ME(或MEM):内存访问(If the instruction does not involve memory access,This phase can be omitted);


对于CISC指令,CPUAssembly line will be according to the instructions of different stages are divided into more.In the stage to carry out the instructions,Different instruction will also be made of different units in charge of,These units can be called执行单元,比如Intel的Ice Lake架构CPUHave the following execution units:

Other execution units are:BM、Vec ALU、Vec SHFT、Vec Add、Vec Mul、Shuffle等.因为CPUInside there is a multiple functional units,所以在同一时刻,Different functional units can actually serve different instruction,如下图所示;

这样Multiple instructions are essentially executed in parallel,It reduces the total execution time,The parallel is called instruction-level parallel.If there is no this kind of parallel structure,Or because of the dependencies between instructions,无法并行,Then the lifecycle will greatly extended,如下图所示:

来看一个实际的例子.假设load和store指令需要3A clock cycle to read and write data,add指令需要1个时钟周期,mul指令需要2个时钟周期.Red is the serial number of the original instruction sequence,The number of green is at the beginning of each instruction clock cycle,Each instruction of the clock cycle accumulated it can calculate.The last one instruction clock cycles is20,Which command to run3个时钟周期,所以在第22Execute all instructions in clock cycles.On the right is the reorder after instruction,一共花了13个时钟周期.There is still a lot of optimization:

Look at the first two instructions on the left,These two instructions mean:Load data to register first,And then do an addition.But loading need3个时钟周期,所以addImpossible to execute instructions can only wait for.The first three items in the right column areload指令,There is no data dependencies between them,can be started one per clock cycle,to the fourth clock cycle,Each instruction data is loading,So the addition operation can be performed.Can make of the right of the content as below,能看到Many instructions on the clock cycle overlap,This is the characteristics of the instruction level parallelism

32. Of course not all orders can be parallel,最后的3Instructions are executed sequentially,There are several reasons for the inability to parallelize

(1)Data dependency constraint.If an instruction to use the results of the previous instruction,It must be behind it,比如下面两条指令add和mul,对于第二条指令来说,Except for the fetch stage(IF)can be paralleled with the first instruction,Other stage requires the result of an instruction to writer1,The second command can only be usedr1The value of the continue to run:

add r2, r1
mul r3, r1

(2)The feature constraint.If only a multiplication calculator,then only one multiplication operation can be performed at a time,如下图所示:

(3)Instruction flow constraints.Instruction flow components a flown条指令.

(4)寄存器约束.寄存器数量有限,Instruction register can not be used in parallel overweight.

The latter three can also be combined into a class,Known as the resource data dependency constraints,If you have because of using the same storage location,Lead to can't parallel,Can be eliminated by renaming the variable,This kind of constraint is calledPseudo constraints.而先写再写,以及先读后写Are the two kinds of pseudo constraint presentation:

(1)先写再写:如果指令AWrite a register or memory location,BAlso write the same place,就不能改变A和B的执行顺序,But you can modify the programA和BWrite a different position.

(2)先读后写:如果A必须在BRead a location before writing to a location,So can't changeA和B的执行顺序.Unless to rename them to use a different position.

Instruction in algorithms reorder point,Is to find out data dependencies between code.The figure below shows the example data dependence between all the various code,Can be called data dependency graph(dependence graph).The edge of it represents the value flow,比如arow loads a data tor1,b行利用r1来做计算,所以b行依赖a行,This figure can also be called优先图(precedence graph),因为a比b优先,b比d优先:

Can give each node plus two attributes in this picture,Order to make use of these two attributes can be sorted:

(1)操作类型,Because it involved what it needs the function of the unit.

(2)Delay property,Is the clock cycles needed for each instruction.

图中的a、c、e、g是叶子,They do not depend on any other nodes,So as far as possible at the top.b、d、f、hMust be present in the respective rely on node behind.而根节点ialways be last.According to the time delay property calculated each node of the cumulative delay(Each node of the accumulative total delay of time delay is equal to the parent node is add this node delay),其中a-b-d-f-h-iPath is the critical path,Code execution time is at least this path is the sum of the flower clock cycle,如下图所示:

因为a在关键路径上,So first consideraNodes are ranked first1行.The rest of the tree,c-d-f-h-iBecomes a critical path,因为cThe cumulative delay biggest,cNodes can be ranked the first2行,如下图所示:

b和eThe cumulative delay is the longest,但由于b必须在a执行完毕后,才会开始执行,So the best wait for enougha的3个时钟周期,Otherwise will still waiting,所以先不考虑b,而是把e放到第3行,如下图所示:

continue in this way,最后可以获得a-c-e-b-d-g-f-h-i的指令序列.However, this code can continue to be optimized:Also is to find and eliminate the false constraint.c和e都向r2value is written in,而d使用的是c写入的值.If you modify the variable name,比如让e使用r3寄存器,就可以去掉e跟d,以及e与cPseudo constraints between,让eCan be inc和d之前.In the same way also can letg使用r4寄存器,使得g可以排在e和f的前面.当然在这个示例中,This change does not reduce the total time consumption,Because there is no change on the critical path dependence,它们都使用了r1寄存器.But in other cases,It is possible to bring greater optimization.如下图所示:

It was actually used just nowList Scheduling算法,大致分为 4 步:

(1)Rename variables to remove spurious constraints(可选步骤).


(3)For each line of code to calculate the priority value(计算方法有很多,Such as example, based on the longest time delay method).

(4)Iterate over processing code and sort.

However, the algorithm needs to take into account the complexity,About the instruction selection、Register allocation and instruction reordering algorithm,其难度(时间复杂度)都是“NP-完全”的,Mean this kind of problemCan't find one with scale(代码行数)Algorithms that increase the size of the calculation at a slower rate(多项式时间算法),从而找到最优解.反之,Computation may as lines of code exponentially rising.因此,Some difficult algorithms in the compilation principle,are in the code generation loop.

33. 到目前为止,To understand the target code generation three factor:指令选择、Register allocation and instruction reordering.Now look at the target code generation,在LLVM中是如何实现的.LLVMBackend multiple processing steps are needed to generate the target code,如下图所示:

The red part in the figure is an important step,It itself contains severalPass,So also called superPass.Figure in the blue boxPass,is used to do some additional optimizations.接下来讲LLVMThe key step in the generated object code:

(1)指令选择.LLVMThe instruction selection algorithm is based onDAG(有向无环图)The tree pattern matching,With the former one speak based onASTThe algorithm has some difference between,But the general idea is the same.这个算法是Near-Optimal(接近Optimal)的,Be able to complete the instructions in the linear time selection,And it focuses on the size of the generated code,Required size is small enough.DAGIs a fusion of the common subexpressionAST,也是一种结构化的IR.The following two lines of code correspond toAST和DAG如下图所示,能看到DAG把a=5This tree KeZi to merge:

a = 5
b = (2 + a)+ (a * 3)

LLVM把内存中的IR模型,Is turned into a reflection characteristics of a target platformSelectionDAG,For instruction selection.Each basic block is converted into aDAG,DAGThe nodes usually represent the instruction,Side to represent the data flow between the instructions.在这个阶段之后,LLVM会把DAG中的LLVM IR节点,Convert all nodes to the target machine,Instructions representing the target machine,而不是LLVM的指令.

(2)指令排序(Register allocation before).Based on the previous step of the process the results,Order of the instructions to.但因为DAGDo not reflect no dependencies between nodes sorting,所以LLVM要先把DAGConvert to a three-address mode,这个格式叫做MachineInstr.This stage will take instruction sequence,And try to use the ability of instruction-level parallelism.

(3)寄存器分配.Do next register allocation.LLVM的IRSupports unlimited registers,On this link to be assigned to the actual register,Distribution fit will spill over into memory.

(4)指令排序(Register allocation after).After allocating registers,LLVMwill do the ordering again.Because register allocation can specify certain register,Register and visit different clock cycles may be different.Overflow into memory for the variables in the,Also added some instruction to transfer data between memory and register.利用这些信息,LLVMWe can arrange further optimize the directive.

(5)代码输出.After doing all the above work,Can output target code.LLVM在这一步把MachineInstr格式转换为MCInst格式,Because the latter is more advantageous to the compiler and linker output assembly code or binary object code.

copyright notice
author[Book memories of Jiangnan],Please bring the original link to reprint, thank you.

Random recommended