Here's a little secret about how Apache Hadoop can help in processing Big Data: It does it in a batch processing mode.
At present, Hadoop can't process data in real-time, or even near real-time. What does this mean? For data that is on a file system or in some kind of storage container (like a data base), it is a matter of ingesting this data into Hadoop, then off we go doing whatever processing is necessary to produce analytic results. For data that is streaming or in real time (like from sensors, streamed video, etc.), this is a bit more difficult and again involves capturing the data to a database or file system and then ingesting it into Hadoop as normal. This was not very elegant and certainly not a very efficient way to process real-time data.
In 2010, a number of Google
researchers developed and released the paper Dremel: Interactive Analysis of WebScale Datasets
, where they describe, design and architect a scalable, interactive ad-hoc query system for doing analysis. This is basically a real-time architecture for doing MapReduce computing. The Dremel paper led to the creation of an IaaS offering as a RESTful Web service called Google BigQuery (BQ). Today, BQ is available to developers programmatically and by subscription from Google. To use data with BQ, the data first must be uploaded to Google Storage and then imported using the BQ HTTP API.
The Dremel paper also influenced the Apache Incubator project Drill. Drill is currently in the design phase with an active team. The project’s objective is to be an open source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Part of the design goals is to scale to over 10,000 servers, process petabytes of data and trillions of records a second, and provide the flexibility needed to support a range of query languages, data formats and data sources. It also must be able to process structured, unstructured and semi-structured data in real-time. As the Apache Drill Project evolves, it is expected that more feature rich capabilities will be added to support the real-time ability to process Big Data.
There are currently two projects that can be used to process Big Data in real-time. The first is Storm and the newest is Impala.
Storm is an open source fault-tolerant distributed real-time computation system that originally came from Twitter. Storm provides a set of primitives for undertaking distributed real-time computation and is used with a queuing system like Kestrel, Kafka or JMS. A NoSQL database provides a similar capability as Hadoop except it's designed to operate in real-time as opposed to batch. That means that Storm is able to process unbounded streaming data in real-time.
The newest open source project for processing in real-time – and just made available in beta this past October -- is Cloudera’s Impala. With Impala, you can do real-time data queries in Apache Hadoop, and it doesn't matter whether the data is stored in HDFS or Apache HBase. Impala directly accesses the data through a specialized distributed query engine similar to those found in commercial parallel RDBMSs.
So are these real-time systems replacements for MapReduce? Absolutely not. There will continue be many viable use cases for Hadoop/MapReduce. Apache Drill, Storm and Impala complement current approaches and cases where users need to interact and process very large data sets across multiple sources, but where focused result sets are needed very quickly.