
The C++ Compilers
We're going to compare three compilers: The Intel C++ compiler, the GNU C++ compiler (g++), and the LLVM clang compiler. Because processors are growing with regard to number of cores, we're seeing a lot more parallel processing in C++ programs. With the right tools, we can develop a single application that spins out threads across multiple cores, whether it's a quad core on a desktop machine, or a 60-core Xeon Phi co-processor, or a 1024 core Nvidia graphics processor. So I'm also going to test how the resulting executables built under these compilers perform using two different parallel programming tools: Threading Building Blocks and Cilk Plus. (However, two quick caveats: To test Cilk Plus on both g++ and LLVM, I had to obtain special branches of both compilers. Also, Threading Building Blocks hasn't yet been ported to clang on Linux, and as such, clang wasn't included in that part of the test. It has on the Mac, but I didn't have access to a Mac for these tests.) For testing the compilers themselves, I decided to see how the compilers perform on an enormous amount of code consisting of over 600 templates, and six thousand different template instantiations. Most C++ programmers rely heavily on templates today. The core C++ library is made up of templates, and most likely, if you're a C++ programmer, you're using templates. So this would be an interesting test as well.The Hardware
The hardware is a dual-processor system, where each processor is an eight-core Intel Xeon E5-2670 running at 2.60 GHz. Each core has two hyperthreads. That means the system has 16 cores, and 32 virtual cores.The Tests
To get the tests started, I compiled Threading Building Blocks with the Intel compiler and the g++ compiler. (For the g++, I used the regular build, not the Cilk Plus branch. Initially I used the Cilk Plus one, but it ran extremely slow. They're still building it, so I figured it wouldn't be fair to include it in this particular test when I saw that slowness.) I used identical build setups, including optimization levels. Using the built-in time command (which is part of the bash shell; there's a separate /usr/bin/time command I'll use for later tests), I saw the following results. Note that I only compiled the Threading Building Blocks core libraries, not the samples and tests that accompany it. (The make file has the tests run automatically after they're built.) Now these results are only preliminary; later in the testing, I received more detailed numbers on how how these compilers perform. But this will get us started. Time to build Threading Building Blocks with Intel compiler: real 0m28.946s user 0m21.477s sys 0m5.204s Time to build Threading Building Blocks with g++: real 0m22.941s user 0m19.701s sys 0m2.332s (The Cilk Plus version of g++ took about one and a half times as long as the regular version, which tells me the compiler itself probably wasn't optimized.) Note that even though this machine is multicore, I intentionally did a single-core build on these by leaving off the -j option. To time these, I used the bash time command. That command gives us three numbers: The total time, the user time, and the system time. The total time is the total time elapsed from the time the command started to the time the test ended; but that includes any time the test had to sit and wait for the operating system to do other, unrelated things. The user time is a more accurate accounting of how long the process was actually running in the CPU; and the sys time is how long the operating system took to perform things the process asked it to perform. I'll list all three numbers, but the most important one here is the user time. The g++ compiler ran faster. But remember, this is building the libraries, which is more than just compiling. Specifically, it's building three libraries with both the debug and release version of each, for a total of six libraries. To build each library, multiple C++ files are compiled, and then linked. If we wanted to, we could add measurements right inside the make configuration file, adding a time command before the compiler calls, like so:CPLUS = /usr/bin/time -f " TIME %E" g++This would use the main time command, not the bash one. Then do similarly with the build configuration for the Intel compiler. In doing so, I was able to compare side by side the compile-only times. The results are too long to print here, but in general the gcc time was faster, between 75 percent and 85 percent that of the Intel time. This gives us a general idea of what we have. Now let's try some more specific tests.
Compiling a Gigantic C++ File
I won't post the code here because it's very long, but I'll describe it. The code I'm testing contains no #include directives, and makes use of only standard C++ code. It starts with one class, and then is followed by 6084 small classes derived from various instantiations of the template classes. (So these 6084 classes are technically not templates themselves.) Then I create 6084 instantiations of the original template class, using each of the 6084 classes. The end result is 6084 different template instantiations. Now, obviously in real life we wouldn't write like that (at least I hope you don't). But the goal here was to make something complex that would simply take a good while to compile without regard to linking. For each compiler, then, I first did two different runs—one with no optimization, and one with maximum optimization, and compile only. Then I repeated the tests, but did a full link afterwards so that I could measure the final file size. I included the times for the compile and link as well. Compile Only, release build (no debug symbols); bottom row shows .o object file size:Intel, no optimization | G++, no optimization | clang, no optimization | Intel, full optimization | G++, full optimization | clang, full optimization |
real 0m3.136s | real 0m4.840s | real 0m1.566s | real 0m6.074s | real 0m2.974s | real 0m1.752s |
user 0m2.952s | user 0m4.400s | user 0m1.496s | user 0m5.868s | user 0m2.632s | user 0m1.652s |
sys 0m0.132s | sys 0m0.436s | sys 0m0.064s | sys 0m0.184s | sys 0m0.344s | sys 0m0.096s |
.o 2276680 bytes | .o 2306776 bytes | .o 2404312 bytes | .o 1448 bytes | .o 1312 | .o 1912 bytes |
Intel, no optimization | G++, no optimization | clang, no optimization | Intel, full optimization | G++, full optimization | clang, full optimization |
real 0m3.337s | real 0m4.987s | real 0m1.759s | real 0m6.144s | real 0m2.959s | real 0m1.627s |
user 0m3.096s | user 0m4.592s | user 0m1.664s | user 0m5.908s | user 0m2.576s | user 0m1.532s |
sys 0m0.196s | sys 0m0.384s | sys 0m0.092s | sys 0m0.196s | sys 0m0.376s | sys 0m0.084s |
864558 | 748693 | 871573 | 20331 | 8329 | 8329 bytes |
Parallel Code Tests: Threading Building Blocks
To test Threading Building Blocks, I used some very simple TBB code:#include <iostream> #include <tbb/tbb.h> #include <tbb/parallel_for.h> #include <cstdlib> using namespace std; using namespace tbb; long len = 0; float *set1; float *set2; float *set3; class GrainTest { public:(I'm not initializing the arrays, and that's okay; I mainly wanted to see how long it took to run through the loops, and I didn't care with this test about the actual values in the array elements.) I passed a parameter of 10 billion. The Linux top utility showed me that in both cases (Intel and g++), I maxed out the 32 cores, which was good. Here are the numbers:void operator()( const blocked_range<size_t>& r ) const {
for (long i=r.begin(); i!=r.end(); ++i ) {
set3[i] = (set1[i] * set2[i]) / 2.0 + set2[i];
}
}
};
int main(int argc, char* argv[]) { cout << atol(argv[1]) << endl; len = atol(argv[1]);set1 = new float[len]; set2 = new float[len]; set3 = new float[len];
parallel_for(blocked_range<size_t>(0,len, 100), GrainTest() );
return 0;
}
Intel | g++ |
real 0m10.983s | real 0m10.510s |
user 0m54.519s | user 0m53.147s |
sys 3m39.858s | sys 3m30.489s |
Parallel Code: Cilk Plus Exensions
For this test I'm going to use the Cilk Plus extensions. These are available on the Intel compiler, and as special branches of the g++ and clang compilers. Although it's not part of the C++ standard, it's a useful approach for writing parallel code. (There are other approaches as well, which we can study in a future article if there's interest. For example, OpenMP is a common approach for writing parallel code as well.) Note that originally I planned to make use of the Cilk Plus array notation, but I was unable to as the feature isn't complete in the Cilk Plus extensions of either gcc or clang. So instead I'm using loops. First, I'm running the tests with multicore loops; and then I'm running the test with vectorized multicore loops. Here's the basic code:#include <iostream> #include <cilk/cilk.h> #include <cstdlib> using namespace std; int main(int argc, char* argv[]) {I'm testing these with an array size of two billion. Here are the results for the non-vectorized multicore code. (I'm using the /usr/bin/time command, not the bash version of time.)unsigned long size = atol(argv[1]); long* a = new long[size+1]; long* b = new long[size+1]; long* c = new long[size+1]; cilk_for (long i=0; i<size; i++) {
a[i] = i % 10; b[i] = i % 9; c[i] = a[i] * b[i] + i;
}
}
Intel | G++ | Clang |
User 14.81 | User 12.47 | User 19.21 |
System 205.92 | System 211.39 | System 195.16 |
Elapsed 0:09.98 | Elapsed 0:11.28 | Elapsed 0:10.96 |
#include <iostream> #include <cilk/cilk.h> #include <cstdlib> using namespace std; int main(int argc, char* argv[]) {The second loop will vectorize automatically by all three compilers. Here are the results, with the same loop size as the previous test:unsigned long size = atol(argv[1]); long* a = new long[size+1]; long* b = new long[size+1]; long* c = new long[size+1]; cilk_for (long i=0; i<size; i++) {
a[i] = i % 10; b[i] = i % 9;
}
cilk_for (long i=0; i<size; i++) {c[i] = a[i] * b[i] + 1;
}
}
Intel | G++ | Clang |
User 14.52 | User 9.42 | User 19.21 |
System 185.91 | System 176.18 | System 195.16 |
Elapsed 0:09.42 | Elapsed 0:08.93 | Elapsed 0:10.96 |