Kostenko Maxim Shutterstock Within the next 10 to 15 years, we may reach the end of Moore’s Law, which dictates that the number of transistors on a CPU will double every two years, with an accompanying bump in processor speed. In theory, transistors will shrink too small to maintain the trend; even now, manufacturers and developers seem to think the way forward involves adding more cores to processors, and using various forms of parallelization (such as vectorization) that permit multiple data operations with one instruction. Writing code that utilizes more cores is harder for older languages, such as C++, that were created back when most processors had just one core. However, various methods have been devised; in this article, we'll look at two techniques with sample code. Check out the latest C++ Jobs.

The Five Reasons?

Here are the key reasons for writing code that uses more cores:
  • Speed up processing
  • Keep processes interactive
  • Overlap data and I/O processing
  • Use fewer resources
  • Simplified sharing
Processing time can be sped up if the process can be split into multiple discrete computations and it’s running on a CPU with multiple cores, translating into reduced execution time. If the application is GUI-based and features intensive computation, GUI freezes will occur unless the necessary heavy lifting takes place in a background thread. The same is true for large I/O, whether disk or network-based: Spreading the processing between multiple cores can reduce resource usage, compared to doing everything in one thread. Creating and managing threads is a challenge if you write the multithreading code yourself (perhaps using POSIX or Windows Threads); there are a lot of edge cases that may only reveal themselves in production due to timing; they may not appear in a debugger.

Threading Without Threads

So we want to use threads but would prefer to avoid the possible bugs. A few libraries provide this functionality; the most popular are the dual-licensed Intel TBB (Threading Building Blocks) and open-source OpenMP. Both libraries support C++ and Fortran, and OpenMP also supports C.

Intel TBB

Consisting of data structures and algorithms, Intel TBB uses templates to balance the workload across multiple cores. It can dynamically reschedule the workload between cores, once a core becomes free after finishing a particular task. Unlike OpenMP though, it works with any C++ compiler and doesn't require specific support; the Intel TBB shared library has to be linked in, and then code can be accessed via the TBB namespace. TBB has functions for doing for, do, while, sort and others in parallel. It also provides concurrent containers for queues, vectors and hash maps, memory allocation and atomic operations; these are low-level operations that can modify memory locations in a way that is guaranteed to be unaffected by other processes. The example below uses Intel TBB to perform a map reduce using a tbb parallel_for to split a large number of values obtained by calling the function getValue(). These are mapped into the outputs array then reduced to a single value output: [cpp] // Hold our outputs std::vector outputs{kMaxValues}; // Uses parallel for loop with a C++11 lambda to perform mapping into outputs tbb::parallel_for(size_t(0),kMaxValues, [&outputs](size_t i){ outputs[i] = getValue(i); } ); // Do reduction double output=0.0; for(double v: outputs) { output += v;} [/cpp]


OpenMP works by having a master thread run slave threads wherever parallelized code needs to run. In contrast to working with Intel TBB, compilers need to explicitly implement OpenMP directives; the most up-to-date is OpenMP 4.0. Microsoft Visual C++ 2013 supports OpenMP 2.0. Adding OpenMP to your program usually requires a command line directive to enable it in the compiler, and pragmas added to the source code around the code sections that can be parallelized. You'll also need omp.h to be included as well as a couple of functions you should call. These are omp_set_num_threads() to set the number of threads available and omp_set_dynamic() that allows dynamic adjustment of threads at runtime (use 0 to disable this): [cpp] #define THREADS 8 #ifdef _OPENMP omp_set_num_threads(THREADS); omp_set_dynamic(THREADS); #endif [/cpp] In the code below, I've used “pragma omp parallel for” to tell it to run the loop in parallel (if possible). This is how you mark up your program: [cpp] vector<node>* routes[num_cities - 1]; #pragma omp parallel for for (int i = 0; i < num_cities - 1; ++i) { vector<node>* search_nodes = new vector<node>(); routes[i] = search_nodes; int dist = find_route(map, city_locs, i, *search_nodes); cost += dist; } [/cpp] For areas with critical sections, you'll use #pragma omp critical, and so on. OpenMP is very good at sharing data between the parallel parts and non-parallel parts. You can specify if data is private to a thread or shared, and how it's initialized, i.e. from the main thread. (Note: One criticism of OpenMP is that it can lead to false sharing, where two or more cores access the same memory cache line and performance takes a hit.)


If you want to speed up the performance of your C++ software with threading, but are concerned about the apparent complexity and risks, consider trying either of Intel TBB or OpenMP. Both are free, though you should check the licenses.