Benchmarking is a tricky business: a valid benchmarking tries to remove all extraneous variables in order to get an accurate measurement, a process that’s often problematic: sometimes it's nearly impossible to remove all outside influences, and often the process of taking the measurement can skew the results. This isn't unique to computer science; it happens in science in general. Physicists can measure a particle that's in a vacuum, but they still have to contend with the so-called “observer effect,” where the simple process of observing a particle can have an impact on the state of the particle. Measuring computer processes has roughly analogous shortcomings, especially with today's operating systems: I can test how long a process takes, and I can run the tests multiple times and do a statistical analysis, but I can't easily turn off all the outside influences. Even if I shut down as many additional processes, my test isn't the only code running on the computer: I can try to isolate the process to a single core, which certainly helps, but the operating system still has power over that core. But is that such a problem? I maintain that, in order to truly benchmark something, we need to do so in a realistic setting. While it's certainly good to know a database lookup can be done in a half-millisecond, we also need to know how long it will take when the database system is bombarded with requests. We want to know best case, worst case, typical case, and rare-but-occasional cases, such as spikes in usage on a Webserver. When it comes to compilers (our current focus), we need to look at the specific language we're compiling and how it gets compiled; and we want to keep things realistic. In benchmarking C++ compilers, we need to factor, for example, how the compiler deals with templates. Templates are generated at compile-time, and loads of templates can have an impact on how long it takes to compile. But we also need to consider the real world: Exactly how many templates are we compiling in a typical production application? If our code uses a couple dozen templates, is it important if compiling ten million templates on one compiler takes a bit longer than another compiler? We could argue that if a developer is working at a computer and compiling hundreds of times a day, then yes, it does matter. As a software developer myself, I know how being forced to wait a few extra seconds on every compilation can really add up over the course of the day. Another aspect for benchmarking a compiler is not how long the compiler takes to compile, but the quality of the resulting code. That gets tricky as well, because there are so many compiler options that can skew the results. One compiler might have switches to generate vectorized code turned on by default, while another compiler doesn't. The resulting code from the first compiler would be vectorized, and may run faster than the code in the second—all while the person who performed the benchmark, unaware of vectorization in the first place, may come to the possibly incorrect conclusion that the second compiler “generates slower code.” And there are different levels of vectorization, each with its own switch for different architectures, such as older SIMD architectures with 128 bits or newer architectures with 512 bits. It's important, then, to not only perform the benchmark, but to specifically note what you're benchmarking, and under what situations. “Your mileage may vary,” as they say. So let's begin.

The C++ Compilers

We're going to compare three compilers: The Intel C++ compiler, the GNU C++ compiler (g++), and the LLVM clang compiler. Because processors are growing with regard to number of cores, we're seeing a lot more parallel processing in C++ programs. With the right tools, we can develop a single application that spins out threads across multiple cores, whether it's a quad core on a desktop machine, or a 60-core Xeon Phi co-processor, or a 1024 core Nvidia graphics processor. So I'm also going to test how the resulting executables built under these compilers perform using two different parallel programming tools: Threading Building Blocks and Cilk Plus. (However, two quick caveats: To test Cilk Plus on both g++ and LLVM, I had to obtain special branches of both compilers. Also, Threading Building Blocks hasn't yet been ported to clang on Linux, and as such, clang wasn't included in that part of the test. It has on the Mac, but I didn't have access to a Mac for these tests.) For testing the compilers themselves, I decided to see how the compilers perform on an enormous amount of code consisting of over 600 templates, and six thousand different template instantiations. Most C++ programmers rely heavily on templates today. The core C++ library is made up of templates, and most likely, if you're a C++ programmer, you're using templates. So this would be an interesting test as well.

The Hardware

The hardware is a dual-processor system, where each processor is an eight-core Intel Xeon E5-2670 running at 2.60 GHz. Each core has two hyperthreads. That means the system has 16 cores, and 32 virtual cores.

The Tests

To get the tests started, I compiled Threading Building Blocks with the Intel compiler and the g++ compiler. (For the g++, I used the regular build, not the Cilk Plus branch. Initially I used the Cilk Plus one, but it ran extremely slow. They're still building it, so I figured it wouldn't be fair to include it in this particular test when I saw that slowness.) I used identical build setups, including optimization levels. Using the built-in time command (which is part of the bash shell; there's a separate /usr/bin/time command I'll use for later tests), I saw the following results. Note that I only compiled the Threading Building Blocks core libraries, not the samples and tests that accompany it. (The make file has the tests run automatically after they're built.) Now these results are only preliminary; later in the testing, I received more detailed numbers on how how these compilers perform. But this will get us started. Time to build Threading Building Blocks with Intel compiler: real    0m28.946s user    0m21.477s sys     0m5.204s Time to build Threading Building Blocks with g++: real    0m22.941s user    0m19.701s sys     0m2.332s (The Cilk Plus version of g++ took about one and a half times as long as the regular version, which tells me the compiler itself probably wasn't optimized.) Note that even though this machine is multicore, I intentionally did a single-core build on these by leaving off the -j option. To time these, I used the bash time command. That command gives us three numbers: The total time, the user time, and the system time. The total time is the total time elapsed from the time the command started to the time the test ended; but that includes any time the test had to sit and wait for the operating system to do other, unrelated things. The user time is a more accurate accounting of how long the process was actually running in the CPU; and the sys time is how long the operating system took to perform things the process asked it to perform. I'll list all three numbers, but the most important one here is the user time. The g++ compiler ran faster. But remember, this is building the libraries, which is more than just compiling. Specifically, it's building three libraries with both the debug and release version of each, for a total of six libraries. To build each library, multiple C++ files are compiled, and then linked. If we wanted to, we could add measurements right inside the make configuration file, adding a time command before the compiler calls, like so:

CPLUS = /usr/bin/time -f "  TIME %E" g++

This would use the main time command, not the bash one. Then do similarly with the build configuration for the Intel compiler. In doing so, I was able to compare side by side the compile-only times. The results are too long to print here, but in general the gcc time was faster, between 75 percent and 85 percent that of the Intel time. This gives us a general idea of what we have. Now let's try some more specific tests.

Compiling a Gigantic C++ File

I won't post the code here because it's very long, but I'll describe it. The code I'm testing contains no #include directives, and makes use of only standard C++ code. It starts with one class, and then is followed by 6084 small classes derived from various instantiations of the template classes. (So these 6084 classes are technically not templates themselves.) Then I create 6084 instantiations of the original template class, using each of the 6084 classes. The end result is 6084 different template instantiations. Now, obviously in real life we wouldn't write like that (at least I hope you don't). But the goal here was to make something complex that would simply take a good while to compile without regard to linking. For each compiler, then, I first did two different runs—one with no optimization, and one with maximum optimization, and compile only. Then I repeated the tests, but did a full link afterwards so that I could measure the final file size. I included the times for the compile and link as well. Compile Only, release build (no debug symbols); bottom row shows .o object file size:

Intel, no optimization G++, no optimization clang, no optimization Intel, full optimization G++, full optimization clang, full optimization
real    0m3.136s real    0m4.840s real    0m1.566s real    0m6.074s real    0m2.974s real    0m1.752s
user    0m2.952s user    0m4.400s user    0m1.496s user    0m5.868s user    0m2.632s user    0m1.652s
sys     0m0.132s sys     0m0.436s sys     0m0.064s sys     0m0.184s sys     0m0.344s sys     0m0.096s
.o 2276680 bytes .o 2306776 bytes .o 2404312 bytes .o 1448 bytes .o 1312 .o 1912 bytes

  Compile and link, release build; bottom row shows executable/binary file size:

Intel, no optimization G++, no optimization clang, no optimization Intel, full optimization G++, full optimization clang, full optimization
real    0m3.337s real    0m4.987s real    0m1.759s real    0m6.144s real    0m2.959s real    0m1.627s
user    0m3.096s user    0m4.592s user    0m1.664s user    0m5.908s user    0m2.576s user    0m1.532s
sys     0m0.196s sys     0m0.384s sys     0m0.092s sys     0m0.196s sys     0m0.376s sys     0m0.084s
864558 748693 871573 20331 8329 8329 bytes

  The clang compiler outran the other two by quite a distance, typically taking half the time as the others. The Intel one was faster than g++ when no optimization was allowed, but took considerably longer when optimization was turned on. That suggests we could do a future set of tests to actually measure the optimization, including looking at the generated assembly code, and so on.

Parallel Code Tests: Threading Building Blocks

To test Threading Building Blocks, I used some very simple TBB code:

#include <iostream> #include <tbb/tbb.h> #include <tbb/parallel_for.h> #include <cstdlib> using namespace std; using namespace tbb; long len = 0; float *set1; float *set2; float *set3; class GrainTest { public:

    void operator()( const blocked_range<size_t>& r ) const {

        for (long i=r.begin(); i!=r.end(); ++i ) {

            set3[i] = (set1[i] * set2[i]) / 2.0 + set2[i];

        }

    }

};

int main(int argc, char* argv[]) { cout << atol(argv[1]) << endl; len = atol(argv[1]);

set1 = new float[len]; set2 = new float[len]; set3 = new float[len];

    parallel_for(blocked_range<size_t>(0,len, 100), GrainTest() );

    return 0;

}

  (I'm not initializing the arrays, and that's okay; I mainly wanted to see how long it took to run through the loops, and I didn't care with this test about the actual values in the array elements.) I passed a parameter of 10 billion. The Linux top utility showed me that in both cases (Intel and g++), I maxed out the 32 cores, which was good. Here are the numbers:

Intel g++
real    0m10.983s real    0m10.510s
user    0m54.519s user    0m53.147s
sys     3m39.858s sys     3m30.489s

  Again, the code built by g++ finished in shorter time, although not by much. To be sure, I ran this multiple times, and overall, the g++ ran slightly faster, by about 5 percent.

Parallel Code: Cilk Plus Exensions

For this test I'm going to use the Cilk Plus extensions. These are available on the Intel compiler, and as special branches of the g++ and clang compilers. Although it's not part of the C++ standard, it's a useful approach for writing parallel code. (There are other approaches as well, which we can study in a future article if there's interest. For example, OpenMP is a common approach for writing parallel code as well.) Note that originally I planned to make use of the Cilk Plus array notation, but I was unable to as the feature isn't complete in the Cilk Plus extensions of either gcc or clang. So instead I'm using loops. First, I'm running the tests with multicore loops; and then I'm running the test with vectorized multicore loops. Here's the basic code:

#include <iostream> #include <cilk/cilk.h> #include <cstdlib> using namespace std; int main(int argc, char* argv[]) {

unsigned long size = atol(argv[1]); long* a = new long[size+1]; long* b = new long[size+1]; long* c = new long[size+1]; cilk_for (long i=0; i<size; i++) {

        a[i] = i % 10; b[i] = i % 9; c[i] = a[i] * b[i] + i;

    }

}

I'm testing these with an array size of two billion. Here are the results for the non-vectorized multicore code. (I'm using the /usr/bin/time command, not the bash version of time.)

Intel G++ Clang
User 14.81 User 12.47 User 19.21
System 205.92 System 211.39 System 195.16
Elapsed 0:09.98 Elapsed 0:11.28 Elapsed 0:10.96

  Notice the system time is higher than the elapsed time. That's because we're dealing with multiple cores. This time, the Intel finished faster. For this final test, I'm going to use vectorized loops. I can't do a direct comparison to the preceding non-vectorized code, because the single loop I have in that code doesn't vectorize. But I can split it into two loops, in which case the second will vectorize. Then I can compare the vectorized code between the compilers. Here's the modified code:

#include <iostream> #include <cilk/cilk.h> #include <cstdlib> using namespace std; int main(int argc, char* argv[]) {

    unsigned long size = atol(argv[1]); long* a = new long[size+1]; long* b = new long[size+1]; long* c = new long[size+1]; cilk_for (long i=0; i<size; i++) {

        a[i] = i % 10; b[i] = i % 9;

    }

    cilk_for (long i=0; i<size; i++) {

        c[i] = a[i] * b[i] + 1;

    }

}

The second loop will vectorize automatically by all three compilers. Here are the results, with the same loop size as the previous test:

Intel G++ Clang
User 14.52 User 9.42 User 19.21
System 185.91 System 176.18 System 195.16
Elapsed 0:09.42 Elapsed 0:08.93 Elapsed 0:10.96

  Notice the clang didn't do so well. But that's because it's still an early release; as of this writing, the official page for the Cilk Plus extensions for clang even states: “SIMD-enabled functions and #pragma simd loops are currently not well-vectorized by LLVM.”

Conclusion

It's interesting that the code built with the g++ compiler performed the best in most cases, although the clang compiler proved to be the fastest in terms of compilation time. But I wasn't able to test much regarding the parallel processing with clang, since its Cilk Plus extension aren't quite ready, and the Threading Building Blocks team hasn't ported it yet. Now finally, let's be absolutely clear: I'm not claiming any one compiler is “better” than the other. I'm making claims about performance in these particular tests. One further test I'd like to see is OpenMP testing. Also, does well-optimized code always translate into better performance? The Intel compiler took much longer to optimize. Does that mean it did a better job? We can't know until we go inside and find out. Care to help?   Image: Antonov Roman/Shutterstock.com