Apache Hadoop

More about Apache Hadoop

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Its Hadoop Distributed File System splits files into large blocks and distributes the blocks amongst the nodes in the cluster. For processing the data, the Hadoop Map/Reduce ships code to the nodes that have the required data, and the nodes then process the data in parallel. This approach takes advantage of data locality, in contrast to conventional HPC architecture which usually relies on a parallel file system. Since 2012, the term "Hadoop" often refers not to just the base Hadoop package but rather to the Hadoop Ecosystem, which includes all of the additional software packages that can be installed on top of or alongside Hadoop, such as Apache Hive, Apache Pig and Apache Spark. The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules. Hadoop Distributed File System a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

Apache Hadoop Popularity Over Time

Related Courses udemy

Description content from Freebase