In any article or blog post, any mention of Big Data usually includes something about Hadoop. When it comes to Big Data, Apache Hadoop has been the big elephant in the room, and the release of Hadoop 2.0 in 2013 made the environment easier and more stable. But even with the inclusion of Impala for querying stored information real-time, Hadoop is still a batch-based system that processes data in, well, batch mode. Big Data processing is said to have three main characteristics, a.k.a. the 3Vs: volume, velocity and variety. Many data scientists and Big Data engineers will argue that Apache Hadoop processes data fast enough to meet the criteria of velocity. Perhaps so. However, coming from a real-time and a high-performance computing background, this argument about Hadoop and velocity reminds me of engineers and computer scientists in the ‘90s who would ask, “Why do you need 64 bits when 32 bits work perfectly fine?” This ability to process Big Data really, really fast -- at higher velocities than Hadoop -- has left the door open for a myriad of proprietary and open source solutions. It has also led the Apache Foundation – the “grey beards” sponsoring Hadoop and others – to fast track viable solutions and projects from incubator status to top level projects. Enter Apache Spark, a “lightning-fast cluster computing” solution for Big Data processing. For those of you who run Apache Hadoop 2.0, how would you like to run programs 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk? How would you like to write applications very quickly in Java, Python or Scala, in such a way that you could build parallel applications that take advantage of your distributed environment? How would you like to combine SQL, streaming and complex analytics in the same application? And finally, how would you like to still access and process all your data from your current Hadoop environment? This is what Apache Spark can do. So what is Apache Spark? Developed in Scala, Spark is an open source distributed computing framework for advanced analytics that can leverage much of the Hadoop storage environment (like HDFS). It was originally developed as a research project at UC Berkeley's AMPLab. In June 2013, Spark achieved incubator project status at the Apache Foundation. In February 2014, it was promoted to a top level project. As of this writing, Apache Spark 0.9.0 is available as open source.
Why Spark instead of Hadoop?
From the beginning, Spark was designed to support in-memory processing so iterative algorithm programs could be developed without writing out a result set after each pass through the data. This ability to keep everything in memory is a high performance computing technique applied to advanced analytics. It allows Spark to be blazing fast at processing speeds that are reported to be 100 times faster than comparable algorithms using MapReduce. Spark has an integrated framework for performing advanced analytics, including the machine learning library MLlib, the graph engine GraphX, the streaming analytics engine Spark Streaming and the interactive real-time query tool Shark. This single platform ensures that users can have consistency in product results across different types of analysis. Spark has an abstraction layer called Resilient Distributed Datasets (RDDs) that are read-only partitioned collections of records created through “deterministic operations on stable data.” RDDs include “information about data lineage together with instructions for data transformation and instructions for persistence.” They’re a distributed collection of objects that can be cached in memory across multiple cluster nodes. RDDs are designed with failure in mind, so if an operation fails it is automatically reconstructed. For more information on RDDs please see this white paper from UC Berkeley's AMPLab. Spark works with any file stored in HDFS or any other storage system supported by Hadoop and supports text files, SequenceFiles and any other Hadoop InputFormat files. Spark Streaming provides an abstraction called “discretized streams,” or DStreams. DStreams are a continuous sequence of RDDs representing a stream of data that’s created from live ingested data or generated by transforming other DStreams. Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory, where they are available for mathematical operations. They are then processed at short time-burst intervals. The collected data become an RDD of their own which is then processed using the usual set of Spark applications. Spark supports programming interfaces for Scala, Java, Python and now R. One of the really cool things about Spark is that you can actually download and run it on your laptop (Note: Don’t do this on a Chromebook). For those new to Spark, this is an excellent opportunity to get familiar with this new technology and get ahead of the game. For those who are interested in learning more about Apache Spark, here’s a quick tutorial. Who supports Spark commercially? Cloudera, Hortonworks and DataBricks all appear to. For those interested in seeing how businesses are using Spark, the latest O’Reilly Strata 2014 Conference in Santa Clara had a great presentation provided by one of the core Spark Team members. You can see the YouTube video here.
Conclusion
I’ve had an opportunity to use Apache Spark in real-world analytics since Sept. 2013. I found GraphX to be a bit buggy, but then I believe it’s still in beta. I used Scala, Python and R for developing applications on Spark and have found it to be blazingly fast. Was this on a laptop? No. It was on a high-performance computing cluster of six nodes with lots and lots of memory that also had an existing Hadoop infrastructure. Would I recommend Spark to others? Yes. Spark fixes many “oversights” that I see in Hadoop. It’s fast and gets the job that I need to get done very quickly. Would I dump my Hadoop and/or Storm environments? No, not yet. Spark is very new and only recently are companies starting to adopt and support it. If this support continues, Spark could be the next evolutionary change in Big Data processing environments.