Data Lakes: Understanding Hadoop

Data Analytics

Date : 04/21/2022

Data Analytics

Date : 04/21/2022

Data Lakes: Understanding Hadoop

Hadoop, originally a search engine, became a key big data tool after Yahoo made it open-source.

Data Lakes: Hadoop

Like the blog

Table of contents

Data Lakes: Understanding Hadoop

Translating this into reality will require a system that is:
Hadoop is endowed with the following components –

Like the blog

Table of contents

Data Lakes: Understanding Hadoop

Translating this into reality will require a system that is:
Hadoop is endowed with the following components –

Data Lakes: Hadoop

1997 was the year of the consumable digital revolution – the year when cost of computation and storage decreased drastically resulting in conversion from paper-based to digital storage. The very next year the problem of Big Data emerged. As the digitalization of documents far surpassed the estimates, Hadoop was the step forward towards low cost storage. It slowly became synonymous and interchangeable with the term big data. With the explosion of ecommerce, social chatter and connected things, data has exploded into new realms. It’s not just the volume anymore.

In part 1 of this blog, I had set the premise that the market is already moving from a PPTware to dashboard and robust machine learning platforms to make the most of the “new oil”.

Today, we are constantly inundated with terms like Data Lake and Data Reservoirs. What do these really mean? Why should we care about these buzz words? How does it improve our daily lives?

I have spoken with a number of people – over the years – and have come to realize that for most part they are enamoured with the term, not realizing the value or the complexity behind it. Even when they do realize, the variety of software components and the velocity with which they change are simply incomprehensible.

The big question here would be, how do we quantify Big Data? One aspect to pivot is that it is no longer about the volume of data you collect, rather the insight through analysis that is important. Data when used for the purpose beyond its original intent can generate latent value. Making the most of this latent value will require practitioners to envision the 4V’s in tandem – Volume, Variety Velocity, and Veracity.

Translating this into reality will require a system that is:

Low cost
Capable of handling the volume load
Not constrained by the variety (structured, unstructured or semi-structured formats)
Capable of handling the velocity (streaming) and
Endowed with tools to perform the required data discovery, through light or dark data (veracity)

Hadoop — now a household term — had its beginnings aimed towards web search. Rather than making it proprietary, the developers at Yahoo made a life-altering decision to release this as open-source; deriving their requisite inspiration from another open source project called Nutch, which had a component with the same name.

Over the last decade, Hadoop with Apache Software Foundation as its surrogate mother and with active collaboration between thousands of open-source contributors, has evolved into the beast that it is.

Hadoop is endowed with the following components –

HDFS (Highly Distributed File System) — which provides centralized storage spread over number of different physical systems and ensures enough redundancy of data for high availability.
MapReduce — The process of distributed computing on available data using Mappers and Reducers. Mappers work on data and reduce it to tuples and can include transformation while reducers take data from different mappers and combines them.
YARN / MESOS – The resource managers that control availability of hardware and software processes along with scheduling and job management with two distinct components – Namely ResourceManager and NodeManager.
Commons – Common set of libraries and utilities that support other Hadoop components.

While the above forms the foundation, what really drives data processing and analysis are frameworks such as Pig, Hive and Spark for data processing along with other widely used utilities for cluster, meta-data and security management. Now that you know what the beast is made of (at its core) – we will cover the dressings in the next parts of this series. Au Revoir!

Topic Tags

Data Lakes

Hadoop

Big Data Analytics

Cloud

Data Engineering

Cloud

Data Engineering

Next Topic

Making the Most of Change Management

Continue reading

Next Topic

Making the Most of Change Management

Continue reading

our categories

Telecom, Media, Technology

Travel & Hospitality

Healthcare & Life Sciences

Banking & Financial Services

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

recommended articles

Demystifying Google Cloud’s Core Data Ingestion Services: A Comprehensive Guide to DataProc, DataFlow, and Data Fusion

Blog

Demystifying Google Cloud’s Core Data Ingestion Services: A Comprehensive Guide to DataProc, DataFlow, and Data Fusion

B2B Rebate Management: A complete guide in 2025

Blog

B2B Rebate Management: A complete guide in 2025

Five imperatives for the modern Chief Data Officer

Blog

Five imperatives for the modern Chief Data Officer

×

Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article

×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.