004 The History of Hadoop – For the Love of Data

Let me set the stage for you…

It’s 2003: Chicago just won the Oscar for Best Picture and Grand Theft Auto: Vice City is the top selling video game. Apple iPods still have scroll wheels and iTunes just started selling music for the first time. From a tech standpoint, Windows XP is all the rage as the latest Windows OS and folks with a lot of money to spend are buying PCs with a Pentium 4 3.0 GHz processor, 512MB of RAM (or maybe up to 2GB max), and an 80GB hard drive. Oracle just released version 10g and Microsoft proponents are still using SQL Server 2000. Internet Explorer 6 dominants the browser wars with about 85% market share and two-thirds of the US still connect to the internet with a modem.

(Stats from various Google searches, CNET desktop reviews, and http://www.internetworldstats.com/articles/art030.htm)

In the years leading up to Hadoop’s inception, Doug Cutting, the first node in the Hadoop cluster, had been working on Lucene, a full text search libary, and then began work on indexing web pages with University of Washington graduate student Mike Cafarella. The project was called Apache Nutch, and it was a sub-project of Lucene. They made good progress getting Nutch to work on a single machine, but they reached the processing limits of that one machine and began manually clustering four machines together. The duo started to spend the majority of their time figuring out a way to scale the infrastructure layer for better indexing. In October 2003, Google released their Google File System paper. This paper did not describe exactly what Google did to implement their solution, but it was an excellent blueprint for what Cutting and Cafarella wanted to do. They spent most of the next year (2004) working on their implementation and labeled it the Nutch Distributed File System (NDFS). In this implementation, they made a key decision to replicate each chunk of data on multiple nodes, typically three, for redundancy.

After solving for infrastructure redundancy, the team set their sights on improving the computational side and taking advantage of the stable fabric of nodes. Google again provided a spark of inspiration with their MapReduce research paper. The approach provided parallelization, distribution, and fault tolerance; all of these work in conjunction to work through tasks quickly, regardless of hardware failures that might occur along the way.

In 2006, Cutting went to work for Yahoo, and the storage and compute components of Lucene separated into a sub-project called Hadoop. The name originated from a toy yellow elephant that belonged to Cutting’s son. In April Hadoop 0.1.0 was released and it sorted almost 2TB of data in 48 hours. By April of 2007 Yahoo was running two Hadoop clusters of 1,000 machines and other companies like Facebook and LinkedIn start to use the tool.

By 2008, Hadoop hit critical mass along several fronts. Yahoo transitioned the search index that drove their website over to Hadoop and contributed Pig to the Apache Software Foundation. Facebook also contributed Hive, bringing SQL atop Hadoop. The product also spawned commercial legs when Cloudera was founded; Cutting joined their ranks the following year.

In 2011 Hortonworks spun off from Yahoo, and the following year Yahoo’s Hadoop cluster reached 42,000 nodes. Also in 2012, Hadoop contributors began to replace MapReduce with YARN, an offshoot of MapReduce’s resource management and scheduling components. Late in the year Apache Hadoop 1.0 becomes generally available. In 2013, Yahoo begins YARN in production and Hadoop 2.2 debuts.

Fast forward to today and several vast ecosystems exist around Hadoop in among different prepackaged distributions. The most popular of these are Cloudera, Hortonworks, and MapR. Below is a snapshot of Hortonworks and Cloudera’s packaged components:

Hortonworks: