Introduction to Hadoop

Nov 022013

Apache Hadoop is an open source software project based on JAVA. Basically it is a framework that is used to run applications on large clustered hardware (servers). It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the reliability of these clusters comes from the software’s ability to detect and handle failures of its own.

Credit for creating Hadoop goes to Doug Cutting and Michael J. Cafarella. Doug a Yahoo employee found it apt to rename it after his son’s toy elephant “Hadoop”. Originally it was developed to support distribution for the Nutch search engine project to sort out large amount of indexes.

In a layman’s term Hadoop is a way in which applications can handle large amount of data using large amount of servers. First Google created Map-reduce to work on large data indexing and then Yahoo! created Hadoop to implement the Map Reduce Function for its own use.

Map Reduce: The Task Tracker- Framework that understands and assigns work to the nodes in a cluster. Application has small divisions of work, and each work can be assigned on different nodes in a cluster. It is designed in such a way that any failure can automatically be taken care by the framework itself.

HDFS- Hadoop Distributed File System. It is a large scale file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.

Big Data being the talk of the modern IT world, Hadoop shows the path to utilize the big data. It makes the analytics much easier considering the terabytes of Data. Hadoop framework already has some big users to boast of like IBM, Google, Yahoo!, Facebook, Amazon, Foursquare, EBay etc. for large applications. Infact Facebook claims to have the largest Hadoop Cluster of 21PB. Commercial purpose of Hadoop includes Data Analytics, Web Crawling, Text processing and image processing.

Most of the world’s data is unused, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business and if you had a way to analyze that data. Hadoop will bring this power to an enterprise.

Other Hadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™: A high-performance coordination service for distributed applications.

Aptude Inc. has more information on their site about Hadoop & Big Data

Article Source: http://EzineArticles.com/7516188

Linuxaria

Introduction to Hadoop

Popular Posts:

Leave a Reply Cancel reply