
Regarding Vectorization
Before we get started, I want to address another issue that came up in the Slashdot forums. I'm examining vectorization, but not everybody was clear on what that means. The term “vector” has its root in mathematics, but that's not the kind of vectorization we're talking about here. We're not talking about the vectorization performed by such products as Mathematica. In the world of mathematics, vectors are sets of scalar numbers. You can multiply them, for example, and perform dot products; that’s all part of Linear Algebra. But in the world of computer processor architecture, vectorization has a different meaning: it’s when you pack multiple values into a single hardware register, and perform arithmetic and other operations on all numbers in that register simultaneously with a single assembly instruction (the relevant term is “SIMD,” which stands for Single Instructions, Multiple Data). So while products such as Mathematica can perform vector operations, those aren’t the kind of vectors we're talking about here. You can use hardware vectorization to perform mathematical vector operations; in fact, that's an excellent use of them. But it's still a separate topic. Mathematica and other mathematics products have leveraged vector processing for years, but that doesn't automatically mean they use SIMD operations at the assembly level.Ground Rules
As usual, let's lay some ground rules on what I'm testing. Right now, for the scope of this article only, I'm not interested in code that compiled successfully under 4.7 but doesn't compile under 4.8. People have reported plenty of those problems and they're being tracked in the GCC bug database. (Many of the reports are of internal compiler errors, as opposed to the compiler generating an error message about the code it's compiling.) Remember, like any software, the GCC compiler set is itself a large set of code created by a massive number of developers, and thus has bugs in it. The developers are working on it. Instead, I’m looking at the actual generated assembly code. Suppose you have code that compiles and runs just fine; you upgrade the compiler, and the code still compiles grandly without any changes, but you start to get bugs and problems when your software runs—problems that didn't exist before. That can be a major headache. Your code worked fine before, so should you try to fix it to work with the new compiler? Well, before you can do that, you really need to know what’s actually going on under the hood. And the problems under the hood can take several forms: The assembly code generated by the compiler might be different; or the runtime libraries, which are also a new version, might have a bug in them; or there’s an enhancement of which you were unaware. Either way, you're faced with a decision: Do you revert back to the older compiler, or do you modify your code to make it work with the new compiler? Reverting back to the older compiler is problematic, because then you have to wonder how long you'll be stuck with the old compiler. (I worked at a large company in 2006 that used a compiler on a decade-old project, because its code didn't work with the newer compilers. The compiler was built prior to the ANSI 1998 standard, which caused a lot of code to break. And to get that compiler to work, the C-level tech engineers insisted we install a very old version of an IDE long since abandoned by its manufacturer. That made for an incredibly frustrating situation, as you can imagine.) On the other hand, if you adjust your code to work with the new compiler, then what will happen when the next version comes out? Will any of these possible problems be removed? Or are the “problems” actually enhancements? Many of us lived through this nightmare in 1998 and 1999 when the compiler vendors started upgrading their products to be compliant with the 1998 standard. Adjusting for the compiler changes can cause headaches—but some headaches are avoidable if you work with the changes, rather than against them.Changes to Optimization
Before we get into the assembly code, let's consider the factor that can influence the generation of the assembly code: optimization. The 4.8 release of GCC includes a new optimization level noted in the command line with -Og. The idea here is to support better debugging and fast compilation with (as the release notes say) “a reasonable level of runtime performance.” (This also addresses an issue Slashdot readers seem to disagree on, and that's whether compilation time is an issue at all. Developers working on a large code base with a large team who pull in changes throughout the day don't want to wait ages for the project to compile just to test out their code. In cases like that, the time it takes to compile is most certainly an issue.) So if your code works with no optimization, does it still work with the optimized code? When it comes to compilers, those that optimize produce assembly code that's different from non-optimized. Whether your code still works or not with these optimizations is up to you to decide, via testing. If your code no longer works right for some reason, you can lower the optimization level. In this test today I'll be looking at the optimizations as well—both to check the optimization itself, but also because that's how you enable autovectorization.The Test
To perform this test, I started with two clean Linux installations, specifically Ubuntu 13.10 server. On one I installed version 4.8.1 of gcc and g++. On the second, I installed version 4.7.3. Here's the code I was dealing with in the previous article. This is a slightly modified version from the original (found at http://locklessinc.com/articles/vectorize/):void test6(double * __restrict__ a, double * __restrict__ b) {(The modification is to make it work with the C++ compiler. The original report used C, rather than C++.) When I turn the optimization to level 3, I get a pretty lengthy amount of code, whether I use the 4.7 or 4.8 compiler. On the surface, this code appears quite different. Due to space (and my unwillingness to bore readers any more than necessary), I won't reproduce the resulting assembly code here. But I will say the 4.8 version, for just the loops, is about ten lines shorter. In both cases the code is vectorized. The vectorized portion, which is basically this line of C++ code—size_t i, j;
double *x = (double *)__builtin_assume_aligned(a, 16);
double *y = (double *)__builtin_assume_aligned(b, 16);
for (j = 0; j < SIZE; j++)
{
for (i = 0; i < SIZE; i++)
{
x[i + j * SIZE] += y[i + j * SIZE];
}
}
}
x[i + j * SIZE] += y[i + j * SIZE];—is almost the same, except for a minor difference in how the data is moved in and out of the registers. (The 4.7 version uses two registers; the 4.8 version uses only 1.) The rest of the difference centers on how the loop is optimized. Now remember: The code runs in both cases. It doesn't have a bug. What we're dealing with here, then, is a matter of the developers revising the assembly code generation and optimizing algorithms. Nevertheless, the code is different. When I turned off optimizations, I ended up with code that was almost identical, except for two lines of assembler where the 4.8 used a slightly tighter method of comparing if one number is less than another. In other words, they were virtually the same. What does this mean for us? The GCC developers are continuously updating their code, including the optimizations. What we got in 4.8 is considerably different (re: dealing with the loops); in this case, it works.