Beyond Hadoop: Open-Source Options for Big Data Analysis
It’s virtually impossible to read any article about big data without hearing about Hadoop and the NoSQL class of databases. With a potential $34 billion lined up to be spent on big data initiatives in 2013, according to Gartner, it looks like the market is widening to include other platforms and technologies that can tackle the number crunching and analytics that big data demands. For example, you can check out Storm and Kafka. Storm, which was developed at Twitter, is a distributed real-time computation system that addresses real-time processing the way Hadoop did for batch processing. Kafka is a messaging system developed at LinkedIn that is the foundation for its activity stream as well as the data processing pipeline behind it. When combined, they can conduct stream processing at linear scale and every message gets processed in real-time. In fact, they can process tens of thousands of messages every second. R is an open source statistical programming language that’s been around since 1997 and is used by two million analysts. It’s quickly becoming the new standard for statistics. Industry watchers have noted that R has made headway in the SAS and SPSS markets and has become a tool to be used by statisticians, data scientists and analysts. Gremlin and Giraph are for graph analysis and are sometimes used in tandem with graph databases like Neo4j or InfiniteGraph or sometimes Hadoop itself. Think of Google’s Pregel. These are the open source alternatives. If you’d rather find a way to extend the functionality of Hadoop and make it more accessible to more people, you can check out Platfora, a new application that runs on top of Hadoop, making it easier for non-analysts to drill into the data. Platfora creates a layer over Hadoop and the data and helps users visualize what they are exploring through what the program calls “lenses.” Having a Web-based interface to do data analysis is just what some enterprises are looking for.