Back in December 2011, data-intensive Linux users rejoiced as Apache Hadoop reached its 1.0.0 milestone. Setting a benchmark for distributed computing software, this wonderful little program is now into release 1.0.3 but what is Hadoop and how can you benefit from using it?
Designed with ‘web-scale’ operations in mind, Hadoop can handle massive amounts of information, allowing you to quickly and efficiently process volumes of data that other systems simply cannot handle. But that’s just the beginning. Hadoop also allows you to network this process – it can distribute large amounts of work across a cluster of machines, allowing you to handle workloads that a single processor simply cannot manage.
In fact, Hadoop is used by a number of major players in the digital industry – most notably, Facebook, Yahoo! and even Twitter (using it to store and process tweets). It seems that Hadoop is making a name for itself amongst large-scale web services, making it the perfect framework for data-heavy systems such as social networks, search engines and even home insurance databases. But what is distributed computing and why should you be taking advantage of all it has to offer?
Distributed Computing – The Way Forward?
Moore’s law states that the number of transistors that can be used in a single processor will double every two years. With developments in the computer hardware industry simply refusing to slow down, it seems inevitable that single processors will eventually be able to handle much larger volumes of data. The invention of multi-threading and multi-core CPUs has been a significant leap towards this goal, but the problem is that other hardware areas just aren’t catching up.
Take the humble hard disk drive, for instance. If you assume a consistent upper speed of 100MB/sec using four independent I/O channels, 4TB of data would take around 10,000 seconds to read – that’s a loading time of almost 3 hours. Even using solid-state drives (running at speeds of 500MB/sec) you would still expect a loading time of over half an hour. Using distributed computing and a cluster of 100 machines, the loading time plummets to around 3 minutes. Obviously 4TB is a small sample when you think of the sheer volume of data handled on a daily basis by companies such as Facebook and Twitter – in these cases, distributed computing is simply the only way to handle such tasks.
Why Use Hadoop?
Hadoop is a large-scale distributed computing infrastructure and while it’s a useful tool in processing data on a single machine, its power lies in the ability to scale up, using hundreds or even thousands of computers to perform a single task. What’s more, Hadoop is and always has been a free, open-source project. This means that, not only do you get the power of distributed computing without software costs, but it’s also constantly being tested, updated and changed to meet the demands of its users. Since its creation, Hadoop has been designed with massive amounts of data in mind. To put it simply, hundreds of gigabytes of data would be a light load where Hadoop is concerned – depending on the hardware infrastructure, it is more than capable of handling terabytes and even petabytes (1024 terabytes) of data. So if you’re finding yourself in the position where you have just too much data to handle, Hadoop could be the perfect solution.
Hadoop Distributed File System (HDFS)
Naturally, to handle such large volumes of data, Hadoop cannot rely on existing methods of file storage. Instead, it uses its own distributed file system which has been specifically designed to provide high-throughput to access the information. All files in a Hadoop system (when using it as a distributed computing framework) are stored in a redundant fashion across several different machines. While this may seem odd, it’s one of the most important features of Hadoop. When using multiple machines, there’s a much higher risk of something going wrong – most commonly, data corruption. By storing a file multiple times, the HDFS ensures the integrity of the data, allowing it to be sourced from different machines should a single copy become unavailable.
With the sheer flexibility and open-source nature of Hadoop, it’s the ideal framework to build advanced web-databases as well as processes to handle such data. Then there’s the myriad of other Apache projects which are designed to work hand-in-hand with Hadoop – from Avro, a data serialization system, to Zookeeper, a co-ordination service for distributed applications. Apache Hadoop makes all the power of distributed computing (in every configuration you can imagine) available to the masses.