Data Lakes: Hadoop – The makings of the Beast

1997 was the year of consumable digital revolution – the year when cost of computation and storage decreased drastically resulting in conversion from paper-based to digital storage. The very next year the problem of Big Data emerged. As the digitalization of documents far surpassed the estimates…
Data Lakes
Ganesh Moorthy
Ganesh Moorthy
Director – Engineering, Tredence

1997 was the year of consumable digital revolution – the year when cost of computation and storage decreased drastically resulting in conversion from paper-based to digital storage. The very next year the problem of Big Data emerged. As the digitalization of documents far surpassed the estimates, Hadoop was the step forward towards low cost storage. It slowly became synonymous and inter-changeable with the term big data. With explosion of ecommerce, social chatter and connected things, data has exploded into new realms. It’s not just the volume anymore.

In part 1 of this blog, I had set the premise that the market is already moving from a PPTware to dashboard and robust machine learning platforms to make the most of the “new oil”.

Today, we are constantly inundated with terms like Data Lake and Data Reservoirs. What do these really mean? Why should we care about these buzz words? How does it improve our daily lives?

I have spoken with a number of people – over the years – and have come to realize that for most part they are enamoured with the term, not realizing the value or the complexity behind it. Even when they do realize, the variety of software components and the velocity with which they change are simply incomprehensible.

The big question here would be, how do we quantify Big Data? One aspect to pivot is that it is no longer about the volume of data you collect, rather the insight through analysis that is important. Data when used for the purpose beyond its original intent can generate latent value. Making the most of this latent value will require practitioners to envision the 4V’s in tandem – Volume, Variety Velocity, and Veracity.

Translating this into reality will require a system that is:

  • Low cost
  • Capable of handling the volume load
  • Not constrained by the variety (structured, unstructured or semi-structured formats)
  • Capable of handling the velocity (streaming) and
  • Endowed with tools to perform the required data discovery, through light or dark data (veracity)

Hadoop — now a household term — had its beginnings aimed towards web search. Rather than making it proprietary, the developers at Yahoo made a life-altering decision to release this as open-source; deriving their requisite inspiration from another open source project called Nutch, which had a component with the same name.

Over the last decade, Hadoop with Apache Software Foundation as its surrogate mother and with active collaboration between thousands of open-source contributors, has evolved into the beast that it is.

Hadoop is endowed with the following components –

  • HDFS (Highly Distributed File System) — which provides centralized storage spread over number of different physical systems and ensures enough redundancy of data for high availability.

  • MapReduce — The process of distributed computing on available data using Mappers and Reducers. Mappers work on data and reduce it to tuples and can include transformation while reducers take data from different mappers and combines them.

  • YARN / MESOS – The resource managers that control availability of hardware and software processes along with scheduling and job management with two distinct components – Namely ResourceManager and NodeManager.

  • Commons – Common set of libraries and utilities that support other Hadoop components.

While the above forms the foundation, what really drives data processing and analysis are frameworks such as Pig, Hive and Spark for data processing along with other widely used utilities for cluster, meta-data and security management. Now that you know what the beast is made of (at its core) – we will cover the dressings in the next parts of this series. Au Revoir!

From the norm to unconventional analytics: Beyond owning, to seeking data

The scale of big data, data deluge, 4Vs of data, and all that’s in between… We’ve all heard so many words adjectivized to “Data”. And the many reports and literature has taken the vocabulary and interpretation of data to a whole new level. As a result, the marketplace is split into …
Shashank Dubey
Shashank Dubey
Co-founder and Head of Analytics, Tredence

The scale of big data, data deluge, 4Vs of data, and all that’s in between… We’ve all heard so many words adjectivized to “Data”. And the many reports and literature have taken the vocabulary and interpretation of data to a whole new level. As a result, the marketplace is split into exaggerators, implementers, and disruptors. Which one are you?

Picture this! A telecom giant decides to invest in opening 200 physical stores in 2017. How do they go about solving this problem? How do they decide the most optimal location? Which neighbourhood will garner maximum footfall and conversion?

And then there is a leading CPG player trying to figure out where they should deploy their ice cream trikes. Now mind you, we are talking impulse purchase of perishable goods. How do they decide the number of trikes that must be deployed and where, what are the flavours that will work best in each region?

In the two examples, if the enterprises were to make decisions based on the data available to them (read owned data), they would make the same mistakes day in and day out – of using past data to make present decisions and future investments. The effect of it stares at you in the face; your view of true market potentials remains skewed, your understanding of customer sentiments is obsolete, and your ROI will seldom go beyond your baseline estimates. And then you are vulnerable to competition. Calculated risks become too calculated to game change.

Disruption in current times posits enterprises to undergo a paradigm shift; from owning data to seeking it. This transition requires a conscious set-up:

Power of unconstrained thinking

As adults, we are usually too constrained by what we know. We have our jitters when it comes to stepping out of our comfort zones – preventing us from venturing into the wild. The real learning though – in life, analytics or any other field for that matter – happens in the wild. To capitalize on this avenue, individuals and enterprises need to cultivate an almost child-like, inhibition-free culture of ‘unconstrained thinking’.

Each time we are confronted with unconventional business problems, pause and ask yourself: If I had unconstrained access to all the data in the world, how would my solution design change; What data (imagined or real) would I require to execute the new design?

Power of approximate reality

There is a lot we don’t know and will never know with 100% accuracy. However, this has never stopped the doers from disrupting the world. Unconstrained thinking needs to meet approximate reality to bear tangible outcomes.

Question to ask here would be – What are the nearest available approximations of all the data streams I dreamt off in my unconstrained ideation?

You will be amazed at the outcome. For example, the use of Yelp to identify the hyperlocal affluence of catchment population (resident as well as moving population), estimating the footfall in your competitor stores by analysing data captured from several thousand feet in the air.

This is the power of combining unconstrained thinking and approximate reality. The possibilities are limitless.

Filter to differentiate signal from noise – Data Triangulation

Remember, you are no longer as smart as the data you own, rather the data you earn and seek. But at a time when data is in abundance and streaming, the bigger decision to make while seeking data is identifying “data of relevance”. An ability to filter signals from noise will be critical here. In the absence of on-ground validation, Triangulation is the way to go.

The Data ‘purists’ among us would debate this approach of triangulation. But welcome to the world of data you don’t own. Here, some conventions will need to be broken and mindsets need to be shifted. We at Tredence have found data triangulation to be one of the most reliable ways to validate the veracity of your unfamiliar and un-vouched data sources.

Ability to tame the wild data

Unfortunately, old wine in a new bottle will not taste too good. When you explore data in the wild – beyond the enterprise firewalls – conventional wisdom and experience will not suffice. Your data scientist teams need to be endowed with unique capabilities and technological know-how to harness the power of data from unconventional sources. In the two examples mentioned above – of the telecom giant and CPG player – our data scientist team capitalized on the freely available hyperlocal data to conjure up a great solution for location optimization; from the data residing in Google maps, Yelp, and satellites.

Having worked with multiple clients, across industries, we have come to realize the power of this approach – of owned and seeking data; with no compromise on data integrity, security, and governance. After all, game changer and disruptors are seldom followers; rather they pave their own path and chose to find the needle in the haystack, as well!

Does your organization disrupt through the approach we just mentioned? Share your experience with us.

Making the Most of Change (Management)

“Times have changed.” We’ve heard this statement ever so often. Generations have used it to exclaim “things are so complicated (or simple) these days,” or expressing disdain – “oh, so they think they are a cool” generation. Whichever way you exclaim, change has been truly the “constant”….


Sulabh Dhall
Associate Director

“The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn, and relearn.”

– Alvin Toffler

“Times have changed.” We’ve heard this statement ever so often. Generations have used it to exclaim “things are so complicated (or simple) these days,” or expressing disdain – “oh, so they think they are a cool” generation. Whichever way you exclaim, change has been truly the “constant”.

This change is bolstered by a tech-enabled world where the speed at which machines are learning is accelerating – the speed of light.

Let me set this in context with an example from the book of Sales. Unlike in the past, today sales reps are not gauged by the amount of sweat trickling down their foreheads. While they continue to be evaluated in terms of business development and lead conversions, it is not all manual and laborious. Technology advancements have made the process of identifying, prioritizing, scheduling, conversing and converting agile and real-time.

But just knowing change, gathering data and appreciating technology will not suffice. The three will need to be blended seamlessly to yield transformation. Applied to deeper organizational context, “Change” needs to be interpreted – its pace needs to be matched, or even better, its effect needs to be contextualized for differentiation.

Change management in this sense is the systematization of the entire process; right from the acceptance of change to its adoption and taking advantage of it to thrive in volatile times.

But what would it take for complex enterprises, that swear by legacy systems, to turbo charge into the Change Management mode?

To answer this, I will humanize enterprise change management with the Prosci-developed ADKAR Model.

Awareness (getting into the race) – Where can I set up the next retail store, what is the most optimal planogram, how do I determine the right marketing mix, what is my competition doing different, how do I improve customer experience, how do I ensure sales force effectiveness – the questions are ample. By the time you realize and start strategizing, a competitor has dislodged your market position and eaten a large portion of your pie. And while these business problems seem conventional, volatility in the marketplace cry foul. Compound this with high dependencies on dashboards, applications, and the likes for insights, and you’ve seen the side-effects – established enterprises biting the dust.

To survive, organizations will need to be knowledgeable about data that matter viz a viz the noise. They will need to interpret the data deluge in relevance and context; after all, not all data is diamond.

Desire (creating a business case for adoption) – Desire is a basic human instinct. Our insatiable urge to want something more, even better, accentuates this instinct. When it comes to enterprises, this desire is no different; to stay ahead of the curve, to make more profits, to be leaders. But there is no lock-and-key fix to achieve this mark. Realizing corporate “desire” will require a cultural and mindset shift across the organization – top-down. And so, one of the most opportune times could be when there are changes at the leadership, followed by re-organization in the rungs below.

Gamification could be a great starting point to drive adoption in such cases. Allow the scope of experimentation to creep in; invest consciously in simmer projects; give a freehand to analysts to look for the missing piece of the puzzle outside their firewall; incentivize them accordingly. Challenge business leaders to up their appreciation for the insights generated, encourage them to get their hands down and dirty when it comes to knowing their source, ask the right questions and challenge status quo – not just rely on familiarity and past experiences.

Knowledge and Ability (From adoption to implementation) – In business context, “desire” typically translate into business goals – revenue, process adoption, automation, newer market expansion, launch of a new product/solution, etc. Mere awareness of the changes taking place does not translate into achievements. It needs to be studied and change management needs to be initiated.

But how can you execute your day job and learn to change?

The trick here will be to make analytics seamless; almost second nature. Just as the message alert from the bank about any suspicious transaction made on your account, any deviation from the set course of business action needs to be alerted.

Such technology-assisted decisions are the need of today and the future. Tredence CHA solution is an example in this direction. It is intuitive, convenient and evolving, mirroring aspects of Robotics Process Automation (RPA).

Reinforcement (Stickiness will be key) – Your business problems are yours to know and yours to solve. Like my colleague mentioned in his blog, a one size fits all solution does not exist. Solving the business challenges of today requires going to the root cause of it, understanding the data sources available to you, and being knowledgeable about other data combinations (across the firewall or within) that matter. Match this stream of data with relevant tools and techniques that can give you the “desired” results.

Point to keep in mind during this drill is to ensure that you marry the old and new. Replacing a legacy system with something totally new could leave a bad taste in your mouth – with less adoption and greater resistance. Embedded analytics will be key – one that allows you to seamlessly time travel between the past, present and future.

To conclude, whether it is about time to implement change, improving customer service, reducing inefficiencies, or mitigating the negative effect of volatile markets, Change Management will be pivotal. It is a structured, on-going process to ensure you are not merely surviving, rather thriving in change.

Key to bridging the analytics-software chasm: iterative approach + customized solutions, leading to self-service BI

The world of software development and IT services have operated through well-defined requirements, scope and outcomes. 25 years of experience in software development have enabled IT services company to significantly learn and achieve higher maturity…

Key to bridging the analytics-software chasm: iterative approach + customized solutions, leading to self-service BI

Ganesh Moorthy
Ganesh Moorthy
Director – Engineering, Tredence

The world of software development and IT services have operated through well-defined requirements, scope and outcomes. 25 years of experience in software development have enabled IT services company to significantly learn and achieve higher maturity. There are enough patterns and standards that one can leverage in-order to avoid scope-creep and make on-time delivery and quality a reality. This world has a fair order.

It is quite contrary to the Analytics world we operate in. Analytics as an industry itself is a relatively new kid on the block. Analytical outcomes are usually insights generated from historical data viz. a viz. descriptive and inquisitive analysis. With the advent of machine learning, the focus is gradually shifting towards predictive and prescriptive analysis. What usually takes months or weeks in software development usually takes just days in the Analytics world. At best, this chaotic world posits the need for continuous experimentations.

The question enterprises need to ask is “how to leverage the best of both worlds to achieve the desired outcomes?”, “how do we bridge this analytics-software chasm?”

The answers require a fundamental shift in perception and approach towards problem solving and solution building. The time to move from what is generally a PPTware (in the world of analytics) to dashboards and furthermore a robust machine learning platform for predictive and prescriptive analyses needs to be as short as possible. The market is already moving towards this said purpose in the following ways:

  1. Data Lakes – These are on-premise and built mostly with the amalgamation of open source technologies and existing COST software’s – homegrown approach that provides single unified platform for rapid experimentation on data along with capability to move quickly towards scaled solutions
  2. Data Cafes / Hubs – Cloud-based SAAS-based approach that allows everything from data consolidation, analysis to visualizations
  3. Custom niche solutions that serve specific purpose

Over a series of blogs, we will explore the above approaches in detail. These blogs will give you an understanding of how integrated and inter-operable systems rapidly allow you to take your experiments towards scaled solutions, in matter of days and in a collaborative manner.

The beauty and the beast are finally coming together!

SOLUTIONS, WHAT’S NEW?

The cliché in the recent past has been about how industries are racing to unlock the value of big data and create big insights. And with this herd mentality comes all the jargons in an effort to differentiate. Ultimately, it is about solving problems. In the marketplace abstraction of problem solving, there’s a supply side and a demand side…
Sagar Balan
Sagar Balan
Associate Director – Analytics, Tredence

Dell, HP, IBM have all tried to transform themselves from being box sellers to solution providers. Then, in the world of Uber, many traditional products are fast mutating into a service. At Walmart, it is no longer about grocery shopping. Their pick and go service tries to understand more about your journey as a customer, and grocery shopping is just one piece of the puzzle.

There’s a certain common thread that run across all three examples. And it’s about how to break through the complexity of your end customer’s life. Statistics, machine learning, artificial intelligence can’t maketh the life of store managers at over 2000 Kroger stores across the country any simpler. It sounds way too complex.

Before I get to the main point, let me belabor a bit and humor you on other paradigms floating around. Meta software, Software as a Service, cloud computing, Service as a Software… Err! Did I just go to randomgenerator dot com and get those names out? I swear I did not.

The cliché in the recent past has been about how industries are racing to unlock the value of big data and create big insights. And with this herd mentality comes all the jargons in an effort to differentiate. Ultimately, it is about solving problems.

In the marketplace abstraction of problem solving, there’s a supply side and a demand side.

The demand side is an overflowing pot of problems. Driven by accelerating change, problems evolve really fast and newer ones keep popping up. Across Fortune 500 firms, there are very busy individuals and teams running businesses the world over, grappling with these problems. Ranging from store managers in a retail store, to trade promotion manager in a CPG firm, a district sales manager in a pharma firm, a decision engineer in a CPG firm and so on. For these individuals, time is a very precious commodity. Analytics is valuable to them only when it is actionable.

On the supply side, there are complex math (read algorithms), advanced technology and smart people to interpret the complexities. And, for the geek in you, this is a candy store situation. But, how do we make these complex math – machine learning, AI and everything else – actionable?

To help teams/individuals embrace the complexity and thrive in it, nature has evolved the concept of solutions. Solutions aim to translate the supply side intelligence into simple visual concepts. This approach takes intelligence to the edge, thereby scaling decision making.

So, how do solutions differ from products, from meta-software, service as a software and the gibberish?

Fundamentally, a solution is meant to exist as a standalone atomic unit – with a singular purpose of making the lives of decision makers easy and simple. It is not created to scale creation of analytics. For example a solution created to detect anomalies in pharmacy billing will be designed to do just that. The design of this solution will not be affected by the efficiency motivation to apply it to a fraud detection problem as well. Because the design of a solution is driven by the needs of the individual dealing with the problem, it should not be driven by the motivation to scale the creation of analytics. Rather, it should be driven by the motivation to scale the consumption of analytics; to push all the power of machine learning and AI to the edge.

In Tredence you have a partner who can execute the entire analytical value chain and deliver a solution at the end. No more running to the IT department with a deck/SAS/R/Python code, asking them to create a technology solution. Read more about our offerings here.

This blog is the first of the two-part series. The second part will be about spelling the S.O.L.U.T.I.O.N.