Big Data is the generic term for massive amounts of data generated mainly (but not exclusively) by human activities. The growth in structured and unstructured data has accelerated year after year, currently standing (according to IBM
) at 2.5 quintillion bytes per day. For those keeping count at home, that’s 9.125 million petabytes per year. Processing even a tiny fraction of that data presents a major challenge for even the most resource-flush company or developer, but it can be done, via the methods listed below. The first two are through hardware; the rest involve the Intel Data Analytics Acceleration Library (DAAL).
While processors with interrupts
have been around since the 1970s, they don’t represent true parallelization, since a single CPU can only execute one task at a time. For true parallelization, you need multi-core processors. With more than one core, you can carry out multiple operations in parallel; you might, for example, sort an array by partitioning it into two. If each processor sorts half, the job is done quicker. (That's probably not the best example, though, because what you don't want is two processors using the same cache line.) If you’re a lone developer or business analyst attempting to churn through data, you need to keep your processors busy. My 6-core i7-5930K only reaches eight percent when a single core is running flat-out. With hyper-threading effectively doubling the six cores to 12, it takes something like multi-threading (or easier-to-program multi-tasking using std::async and futures in C++11) to crank up the power. There's a slight overhead distributing the work between processors, but throughputs of 10x are achievable. If you want to see this in action, try the Poker Evaluation Code
that evaluates a million poker hands using single or multi-tasking (31 seconds single-tasking or 3.5 multiple-). The evaluation function was so fast that I had to add a one-ms delay (without that, the overhead made the multi-tasking version slightly slower than the single tasking version). Although many C++ compilers can auto-parallelize, turning sequential code into parallel code, the compilers can prove quite conservative, so you may need to use something like OpenMP to carry out this processing to the fullest.
The second hardware technique, vectorization, can lead to greater throughputs, but I don't think it’s quite as effective as parallelization. Since the Pentium III, all processors have had SIMD (Single Instruction Multiple Data) capabilities; you'll see these technologies with names such as SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2 and AVX-512 (on my i7-5930K, it’s SSE4.2, AVX 2.0). The processor is equipped with extra registers, ranging in size from 128 bits (in SSE), 256 in AVX 2.0 and up to 512 bits in AVX-512. These registers are loaded with numeric data such as ints, floats, doubles—but always the same type, no mixing ints with floats and so on. With one instruction, the processor can add, subtract, and perform trigonometric functions on all the elements in the register in one go. For example, the 256 bit registers can hold 64 x 4 byte floats. Many compilers, including GCC and Clang
, can perform auto vectorization for you, but they are pretty conservative; as with parallelization, you may need to use OpenMP
. Even though crunching numbers is the main function of your program, you may find the gains are nowhere as dramatic as with parallelization. The time spent doing computations may only be a small part of the overall execution time; doing the calculations faster won't speed it up that much.
The workflow stages of Intel DAAL, a library for both C++ and Java, is as follows: the data is prepared and converted before being analysed, modelled, validated, and finally used to make predictions. Intel DAAL is available as part of the commercial Intel Parallel Studio, but it's also available for free
via the community edition that’s both zero-cost and royalty-free. (You don’t get any support or maintenance, of course.) Big Data may be structured or unstructured, in databases via ODBC
, or on disk; it’s very likely that it’ll be incomplete at any given time. Yes, you need to transform it into a form suitable for processing; but even after that transformation, it will still likely be too big to process in memory, so it may need to be streamed, compressed, or both. Classes
are provided for serialization/deserialization and compression/decompression. If data is missing or incomplete, you can still use Data model classes to mimic actual data. Intel DAAL provides classes that let you acquire data, in addition to filtering and normalizing it via data-source interfaces. It's then converted to a numeric representation that is held in numeric tables before being streamed from those tables to processing algorithms. After processing, the resulting data is stored in a model or numeric tables. Interfaces are provided for when you have custom or unsupported data types, and you will have to create implementation classes to handling those. There’s a comprehensive library of algorithms for statistical analysis, including machine learning as well as training and prediction classifiers. Not all algorithms support streaming or distributed processing, but all support batch, which requires the entire data set to be processed. Intel DAAL’s services component lets your program handle memory and control the number of threads, among other functions. (You can also determine versioning information for the library.
There are also free versions of Intel Math Kernel Library (Intel MKL), Intel Integrated Performance Primitives (Intel IPP) and Intel Threading Building Blocks (Intel TBB), though those are only for C++, not Java. In order to use the various algorithms, you’ll need some familiarity with statistical and machine-learning methods.