Traditional compilation proceeds one file at a time. The compiler optimizes and generates code for each file in isolation and then the final executable is created by linking all the individual files together.
This model of compilation has the advantage that all the files can be compiled concurrently, which greatly reduces build time. This is particularly useful with current multiprocessor machines and applications consisting of hundreds and even thousands of files. However, this model presents a fairly significant barrier for optimization.
Consider this program:
foo.c:
foo()
{
for (;;) {
...
x += g (f (i, j), f (j, i));
...
}
}
bar.c:
float f(float i, float j)
{
return i * (i - j);
}
float g(float x, float y)
{
return x - y;
}
From an optimization perspective, inlining f and g inside foo is likely going to provide significant performance improvements. However, the compiler never sees both files at the same time. When it is compiling foo.c it does not have access to the functions in bar.c and vice-versa. Therefore, inlining will never take place.
One could get around this problem by moving the bodies of f and g to a common header file and declaring them inline. But that is not always desirable or even possible. So, several optimizing compilers introduce a new feature called Link-Time Optimization (LTO) that allows these kinds of cross-file manipulations by the compiler.
The LTO model essentially splits code generation in two major phases
1. Generation of Intermediate Representation (IR). The original source code is parsed as usual and an intermediate representation (IR) for the program is generated. The IR is a succinct representation of the original program, symbols and types. This contains all the information that the compiler needs to generate final code, except that instead of using it to generate final code for each file, the compiler saves it to an intermediate file for later processing.
2. Once the IR has been emitted for all the source files, the intermediate files generated in the previous step are loaded in memory and the whole set is analyzed and optimized at once.
To visualize this process, imagine that you simply concatenated all the source files together and compiled the result. Since the compiler has visibility over every function in every compilation unit, decisions that would normally use conservative estimates can instead be based on data-flow information crossing file boundaries. Additionally, the compiler is able to perform cross-language optimizations that are not possible when the compilation scope is restricted to individual files.
The LTO model of compilation is useful but it has severe scalability issues. A basic implementation has the potential to incur massive memory consumption during compilation. Since every function body in every file may be needed in memory, only relatively small programs will be able to be compiled in whole program mode.
At Google, we deal with several massively large applications, so we are working on a scalable alternative to traditional LTO called WHOPR (WHOle Program optimizeR), which introduces parallelism and distribution to be able to handle arbitrarily large programs. The basic observation is that to do many whole program optimizations, the compiler rarely needs to have all the functions loaded in memory, and final code generation can be parallelized by partitioning the program into independent sets.
WHOPR then proceeds in 3 phases
1. Local Generation (LGEN). This is the same as traditional LTO. Every source file is parsed and its IR saved to disk. This phase is trivially parallelizable using make -j or distcc or any other similar technique.
No comments:
Post a Comment