Data Lakes: Hadoop – The makings of the Beast
1997 was the year of consumable digital revolution – the year when cost of computation and storage decreased drastically resulting in conversion from paper-based to digital storage. The very next year the problem of Big Data emerged. As the digitalization of documents far surpassed the estimates, Hadoop was the step forward towards low cost storage. It slowly became synonymous and inter-changeable with the term big data. With explosion of ecommerce, social chatter and connected things, data has exploded into new realms. It’s not just the volume anymore.
In part 1 of this blog, I had set the premise that the market is already moving from a PPTware to dashboard and robust machine learning platforms to make the most of the “new oil”.
Today, we are constantly inundated with terms like Data Lake and Data Reservoirs. What do these really mean? Why should we care about these buzz words? How does it improve our daily lives?
I have spoken with a number of people – over the years – and have come to realize that for most part they are enamoured with the term, not realizing the value or the complexity behind it. Even when they do realize, the variety of software components and the velocity with which they change are simply incomprehensible.
The big question here would be, how do we quantify Big Data? One aspect to pivot is that it is no longer about the volume of data you collect, rather the insight through analysis that is important. Data when used for the purpose beyond its original intent can generate latent value. Making the most of this latent value will require practitioners to envision the 4V’s in tandem – Volume, Variety Velocity, and Veracity.
Translating this into reality will require a system that is:
- Low cost
- Capable of handling the volume load
- Not constrained by the variety (structured, unstructured or semi-structured formats)
- Capable of handling the velocity (streaming) and
- Endowed with tools to perform the required data discovery, through light or dark data (veracity)
Hadoop — now a household term — had its beginnings aimed towards web search. Rather than making it proprietary, the developers at Yahoo made a life-altering decision to release this as open-source; deriving their requisite inspiration from another open source project called Nutch, which had a component with the same name.
Over the last decade, Hadoop with Apache Software Foundation as its surrogate mother and with active collaboration between thousands of open-source contributors, has evolved into the beast that it is.
Hadoop is endowed with the following components –
HDFS (Highly Distributed File System) — which provides centralized storage spread over number of different physical systems and ensures enough redundancy of data for high availability.
MapReduce — The process of distributed computing on available data using Mappers and Reducers. Mappers work on data and reduce it to tuples and can include transformation while reducers take data from different mappers and combines them.
YARN / MESOS – The resource managers that control availability of hardware and software processes along with scheduling and job management with two distinct components – Namely ResourceManager and NodeManager.
Commons – Common set of libraries and utilities that support other Hadoop components.
While the above forms the foundation, what really drives data processing and analysis are frameworks such as Pig, Hive and Spark for data processing along with other widely used utilities for cluster, meta-data and security management. Now that you know what the beast is made of (at its core) – we will cover the dressings in the next parts of this series. Au Revoir!