- Multicore programming
- Vectorized programming
Prior Tests From Other People
We don't want to re-invent the wheel here. As such, I decided to start with where some of the earlier tests have left off. Here's an excellent test of how g++ can create vectorized code, when it doesn't, and how to fuss with it to get it to do so. In this article, the author digs into the generated assembly code to see if vectorization is occurring. I would say one important take-home message from the piece is that, even if you think you're going to get vectorized code, you still might not. There are two things you can do: First, the compilers can actually tell you if a function is vectorized (it's an option, and you'll want to turn it on). Second, you still might want to look at the assembly code, not only to make sure you're getting vectorized code, but that the code you're getting is actually better than the serial version. So for our tests, I'm going to take several of the author's functions and try them out on the Intel compiler and then compare the resulting code, as well as what we have to do to get vectorized code.Initial Tests on Intel Compiler
For the first test, I went with the first test presented in the article, compiled with the Intel compiler:#include <stdlib.h> #include <math.h> #define SIZE (1L << 16) void test1(double *a, double *b) { int i; for (i = 0; i < SIZE; i++) { a[i] += b[i]; } } int main() { double *a = new double[SIZE]; double *b = new double[SIZE]; test1(a, b); }After turning on the diagnostics to actually show me what's vectorized,
icc -S -vec-report2 test1.cppI can see that the loop was indeed vectorized with this message:
LOOP WAS VECTORIZED(The -S tells the compiler to generate a file containing the assembly code.) But how good is the vectorization? Adding the -s option, we can look at the resulting assembler code. For space reasons, I won't list the whole thing here, just the code inside the loop:
movsd (%rsi,%rdx,8), %xmm0 movsd 16(%rsi,%rdx,8), %xmm1 movsd 32(%rsi,%rdx,8), %xmm2 movsd 48(%rsi,%rdx,8), %xmm3 movhpd 8(%rsi,%rdx,8), %xmm0 movhpd 24(%rsi,%rdx,8), %xmm1 movhpd 40(%rsi,%rdx,8), %xmm2 movhpd 56(%rsi,%rdx,8), %xmm3 addpd (%rdi,%rdx,8), %xmm0 addpd 16(%rdi,%rdx,8), %xmm1 addpd 32(%rdi,%rdx,8), %xmm2 addpd 48(%rdi,%rdx,8), %xmm3 movaps %xmm0, (%rdi,%rdx,8) movaps %xmm1, 16(%rdi,%rdx,8) movaps %xmm2, 32(%rdi,%rdx,8) movaps %xmm3, 48(%rdi,%rdx,8)We're working with double-precision numbers here, which are 64 bits. But how big are our registers? Well, we didn't specify a SIMD architecture. By default, we're getting size 128. The first eight lines of this code are juggling our numbers around, and then there are four actual addition operations. Each addition is performed with the opcode added. But there are four of these additions—one each for registers xmm0, xmm1, xmm2, xmm3. That's because something separate from vectorization is also happening here, something called loop unrolling. I'll discuss that shortly; it’s sufficient for now to say it's an optimization that is independent of the vectorization. Let's see if we can tweak the compiler a bit. I want to target two different generations of SIMD: SSE4.2 and the newer AVX. We can do so using the -x option. Without changing the C++ code, and then adding the option -xSSE4.2, we end up with this assembler code:
movups (%rsi,%rdx,8), %xmm0 movups 16(%rsi,%rdx,8), %xmm1 movups 32(%rsi,%rdx,8), %xmm2 movups 48(%rsi,%rdx,8), %xmm3 addpd (%rdi,%rdx,8), %xmm0 addpd 16(%rdi,%rdx,8), %xmm1 addpd 32(%rdi,%rdx,8), %xmm2 addpd 48(%rdi,%rdx,8), %xmm3 movups %xmm0, (%rdi,%rdx,8) movups %xmm1, 16(%rdi,%rdx,8) movups %xmm2, 32(%rdi,%rdx,8) movups %xmm3, 48(%rdi,%rdx,8)And with the option -xAVX we get this:
vmovupd (%rsi,%rax,8), %xmm0 vmovupd 32(%rsi,%rax,8), %xmm3 vmovupd 64(%rsi,%rax,8), %xmm6 vmovupd 96(%rsi,%rax,8), %xmm9 vinsertf128 $1, 48(%rsi,%rax,8), %ymm3, %ymm4 vinsertf128 $1, 16(%rsi,%rax,8), %ymm0, %ymm1 vinsertf128 $1, 80(%rsi,%rax,8), %ymm6, %ymm7 vinsertf128 $1, 112(%rsi,%rax,8), %ymm9, %ymm10 vaddpd (%rdi,%rax,8), %ymm1, %ymm2 vaddpd 32(%rdi,%rax,8), %ymm4, %ymm5 vaddpd 64(%rdi,%rax,8), %ymm7, %ymm8 vaddpd 96(%rdi,%rax,8), %ymm10, %ymm11 vmovupd %ymm2, (%rdi,%rax,8) vmovupd %ymm5, 32(%rdi,%rax,8) vmovupd %ymm8, 64(%rdi,%rax,8) vmovupd %ymm11, 96(%rdi,%rax,8)Without spending too much time explaining all this, the AVX ones are operating with twice the register sizes. The opcodes start with a “v” to signify AVX. But in these cases, we're dealing with unaligned data. We see that with the letter “u” in the move operations; we can get better performance by aligning it. The author of the LockLess article found the same to be true with g++. In fact, we can use the same “restrict” keyword, provided we add on the restrict compiler option. Also, we can turn up our diagnostics report to let us know if our data is aligned. By using -vec-report6, we can see status reports such as:
reference a has aligned accessAnd if we forget to align one of our variables, we'll see in the report:
reference b has unaligned accessTo align the data for Intel, we can use code similar to what the LockLess article used, except a different intrinsic:
void test1(double * restrict a, double * restrict b) {And now inspecting the assembler, we see that we're using MOVAPS instead of MOVUPS. The “A” means aligned:int i;
__assume_aligned(a, 64);
__assume_aligned(b, 64);
for (i = 0; i < SIZE; i++) {a[i] += b[i];
} }
movaps (%rdi,%rax,8), %xmm0 movaps 16(%rdi,%rax,8), %xmm1 movaps 32(%rdi,%rax,8), %xmm2 movaps 48(%rdi,%rax,8), %xmm3 addpd (%rsi,%rax,8), %xmm0 addpd 16(%rsi,%rax,8), %xmm1 addpd 32(%rsi,%rax,8), %xmm2 addpd 48(%rsi,%rax,8), %xmm3 movaps %xmm0, (%rdi,%rax,8) movaps %xmm1, 16(%rdi,%rax,8) movaps %xmm2, 32(%rdi,%rax,8) movaps %xmm3, 48(%rdi,%rax,8)But then I encountered a surprise. If any readers can provide an explanation here, I welcome it (and I might put in a call to Intel on this one): if I keep the code as-is, with the alignment all set, and then target the AVX, I get the opcodes that start with V—but they have a “U” for unaligned, instead of an “A” for aligned:
vmovupd (%rdi,%rax,8), %ymm0 vmovupd 32(%rdi,%rax,8), %ymm2 vmovupd 64(%rdi,%rax,8), %ymm4 vmovupd 96(%rdi,%rax,8), %ymm6 vaddpd (%rsi,%rax,8), %ymm0, %ymm1 vaddpd 32(%rsi,%rax,8), %ymm2, %ymm3 vaddpd 64(%rsi,%rax,8), %ymm4, %ymm5 vaddpd 96(%rsi,%rax,8), %ymm6, %ymm7 vmovupd %ymm1, (%rdi,%rax,8) vmovupd %ymm3, 32(%rdi,%rax,8) vmovupd %ymm5, 64(%rdi,%rax,8) vmovupd %ymm7, 96(%rdi,%rax,8)...even though the -vec-report6 stated that the data was aligned. The only reference I can find is a discussion forum on intel.com where an Intel employee states, “The compiler never uses VMOVAPD even though it would be valid.” From the comments in my previous articles, it's clear some of you really know your stuff, so if you know what's up, feel free to chime in.
Loop Unrolling
When we turned up the diagnostic level to 6, we also saw a message:unroll factor set to 4This refers to the fact that, in addition to vectorizing our code, we're also using four separate registers to do four sets of additions, and as such as we four ADDPD calls in our code. This is called loop unrolling, and it's a completely separate issue from the vectorization code. I was surprised to see that the Intel compiler did that by default without me asking for it. We can turn off the option with -loopunroll0. Then we see three lines of vectorized addition, which is identical to that in the LockLess report:
movaps (%rdi,%rax,8), %xmm0 addpd (%rsi,%rax,8), %xmm0 movaps %xmm0, (%rdi,%rax,8)Because this loop unrolling is a separate issue from vectorization, I won't use it to say that Intel's generated vectorized code is somehow “better” because of it. The g++ compiler also supports unrolling under certain conditions.
Embedded Loops
The LockLess article tackles the issue of embedded loops, which gave the g++ compiler some troubles. Here's some code similar to theirs: void test5(double * restrict a, double * restrict b){I get a diagnostic report that the loop can't be vectorized. However, the author of the LockLess piece didn't tackle trying to just get the inner loop to vectorize. With the Intel compiler, we can do so with a pragma: void test1(double * restrict a, double * restrict b)int i,j;
__assume_aligned(a, 64);
__assume_aligned(b, 64);
for (j = 0; j < SIZE; j++) {for (i = 0; i < SIZE; i++)
{a[i + j * SIZE] += b[i + j * SIZE];
}
}
}
{This works; the inner loop gets vectorized. The author instead combined the loops into a single loop, like so:int i,j;
__assume_aligned(a, 64);
__assume_aligned(b, 64);
for (j = 0; j < SIZE; j++)
{
#pragma simdfor (i = 0; i < SIZE; i++)
{
a[i + j * SIZE] += b[i + j * SIZE];
}
}
}
for (i = 0; i < SIZE * SIZE; i++) {which in the end is the same as any other loop, just a bigger upper limit. He then declares:x[i] += y[i];
}
“So, a rule of thumb might be; two loops bad, one loop good.”That's a common problem in parallel programming, and there are different ways you can tackle it. According to Intel, the default is to try to vectorize the innermost loop; although in my test, I didn't end up with a vectorized inner loop. The compiler tried, but it decided there was a loop dependency and decided not to do it. Only after adding the pragma did it do it. But what about g++? Did the LockLess author miss something? In that article the author was using g++ 4.7. I'm using 4.8.1. And it turns out I can force the inner loop to vectorize if I add an optimization level 3 like so:
g++ -S -O3 gcc5.cpp(The -S gives me the assembly output in a file.) I can see the vectorized statements in the assembly output:
movhpd 8(%rcx,%rax), %xmm0 addpd (%r8,%rax), %xmm0 movapd %xmm0, (%r8,%rax)And I can also see it in a report if I turn on the diagnostics like so:
g++ -S -O3 -ftree-vectorizer-verbose=1 gcc5.cppwhich gives me:
Vectorizing loop at gcc5.cpp:16(Line 16 is the first line of the inner loop.) So did the 4.7 not allow it? If you're writing code that uses advanced techniques like vectorization, you're generally going to want to make sure you're running the latest version of the compiler. However, it turns out the 4.7 can vectorize the inner loop. I installed the 4.7 compiler and compiled the above with the same options:
g++ -S -O3 -ftree-vectorizer-verbose=1 gcc5.cppand indeed the inner loop did vectorize:
16: LOOP VECTORIZED. gcc5.cpp:7: note: vectorized 1 loops in function.So it appears the inner loop can get vectorized, with just a tiny bit of coaxing. (Incidentally, the assembly code generated by the 4.7 and 4.8.1 compilers is a good bit different in ways aside from the vectorization, which is interesting. I don't have space to put the entire listings here. Perhaps you might want to explore that yourself; or if there's interest here I can do an analysis in a future article to find out why and whether the differences could potentially impact your development projects.)
More Operations
A simple add operation isn't all that useful. The processor vectorization includes a large set of different operations from which you can construct sophisticated operations. The LockLess article describes an operation that looks like this:for (i = 0; i < SIZE; i++) {Compiling this on with the Intel compiler results in the following vectorized assembly:x[i] = ((y[i] > z[i]) ? y[i] : z[i]);
}
movaps X2(,%rax,8), %xmm0 movaps 16+X2(,%rax,8), %xmm1 movaps 32+X2(,%rax,8), %xmm2 movaps 48+X2(,%rax,8), %xmm3 maxpd X3(,%rax,8), %xmm0 maxpd 16+X3(,%rax,8), %xmm1 maxpd 32+X3(,%rax,8), %xmm2 maxpd 48+X3(,%rax,8), %xmm3 movaps %xmm0, X(,%rax,8) movaps %xmm1, 16+X(,%rax,8) movaps %xmm2, 32+X(,%rax,8) movaps %xmm3, 48+X(,%rax,8)Aside from the unrolled loops, this code is basically the same as what g++ produced. The next step becomes a bit more complex. It looks like a small change, but it's significant. The author of the Lockless article changed the algorithm to include the left item, x[i], inside the operation itself, like so:
x[i] = ((y[i] > z[i]) ? x[i] : z[i]);
With this one, we're seeing something different happen from the g++ experiments. The LockLess report states that this didn't vectorize with g++. But here with my tests on the Intel compiler, it did just fine. The code is a bit long due to the complexity and the loop unrolling, but it has several operations that use the packed notation, and the diagnostics stated that vectorization took place. As a final test, the LockLess author performed this operation:double test21(double * restrict a) {When I see this code, an alarm goes off in my mind. It's actually fine, and will indeed vectorize. But if we try to go another step and add multi-core to it, we're going to have a problem with the y variable and some race conditions. These race conditions will require what's called a reducer. For now, when we're not trying to implement a multicore solution, and only use vectorization on a single core, we'll be fine—and indeed, this code vectorized just fine as well. What's interesting is, not only did it vectorize, but I saw the same elegant results as with the g++ compiler in the LockLess report. The compiler ultimately did what's called a horizontal operation using an UNPCKHPD opcode.size_t i;
double *x = __builtin_assume_aligned(a, 16);
double y = 0;
for (i = 0; i < SIZE; i++)
{
y += x[i];
}
return y;
}