Weekly live demo
Join us every Wednesday for a live demo of Scuba Analytics
Register now

Why Hadoop Plateaued -- And What’s in its Place

Scuba Insights

Facebook was an early adopter of Apache Hadoop, and remnants of Hadoop are still evident in the company’s architecture.

Early product developers at Facebook, were given the challenge of figuring out how to scale the infrastructure to not only store all the data needed but also so it was quickly accessible in order to make important product and business decisions based on that data.

Facebook first deployed Hadoop clusters in 2007, right as it was beginning to hit tens of millions of users and converging on 1 PB of data. 

 

Hadoop was an attractive tool at this time for two main reasons:

 

  • Cost-effective scalability. Hadoop was an open-source framework and used commodity hardware. For that reason it would scale with Facebook’s growth --,  it was already being used at the petabyte scale.
  • No schema on write. This meant developers could literally store everything without having to worry about what to do with it beforehand. All the data was saved in its original form and nothing was thrown away.

It wouldn’t be an exaggeration to say that Hadoop initially changed the lives of product developers at Facebook. 

The team at Facebook spent years improving Hadoop processes and building on it so that it would be a source of analytics.

The problem with Hadoop

But, unfortunately, Hadoop never quite lived up to its promise for analytic workloads. It worked well for storage and batch processing but fell short for analytics in a couple of key places.

Poor performance

The promise of big data analytics isn’t that you can save all the data that you want. It’s that you can save all the data so that every person has the power to answer every kind of question, quickly. And since it’s impossible to ask perfect questions of your data on the first go, the ideal analytics stack should be fast enough to allow for continuous iteration and refinement of your questions.

 

Hadoop’s major strengths are in storing massive amounts of data and processing massive extract, transform, load (ETL) jobs, but the processing layers suffer in both efficiency and end-user latency. This is because MapReduce, the programming paradigm at the heart of Hadoop is a batch processing system. While batch processing is great for some things like machine learning models, or very large, complex jobs, it’s unsuitable for analytics questions that are more ad hoc in nature.

 

A variety of tools such as Spark work to address this problem by using memory more aggressively and designing for lower latency, but they have only been partially successful. 

 

The Hadoop stack has many layers with deep architectural assumptions about batch versus ad hoc workloads; you really have to unwind all of them to achieve performance that matches a system built ground-up for low latency.

 

Plus, the whole reason to use Hadoop is for scale--but once you force things back in-memory, you lose that. If you’re in-memory, you might as well be using a traditional data warehouse that is more mature and full-featured.

 

In fact, the dirty secret of Hadoop is that for the vast majority of analytics workloads, it’s still much worse than the legacy tools it claims to displace. Many of these legacy tools outperform Hadoop in a huge number of use cases.

Stuck in the developer world

For the data revolution to be an actual revolution, the power to ask questions of data is one that every person should have--not just the computer scientists. 

 

Though the product developers at Facebook are clearly analytically minded, it was impossible for those who did not know how to write MapReduce jobs to ask even the simplest questions. And those who did, spent hours to days writing programs instead of exploring their data. (At Google, no one has used MapReduce for analytics in a very long time. It has been replaced by multiple generations of tools built specifically for analytics.).

 

Despite its flexibility, unless you’re a programmer, people have even less access to data when it’s stored in a data lake like Hadoop, than if you simply used “old school” business intelligence tools.

 

This is the fundamental reason Hadoop plateaued as the  “next big thing” in analytics, without ever becoming “the” big thing. It never evolved from the developer world.

 

In the case of Facebook, the developers actually built a suite of custom tools to make the data in Hadoop useful for the non-programmer. For example, Hive sat on top of Hadoop and let people use a SQL-like language to query the data instead of Java. A number of other internal tools were intended to help with adoption, but ultimately, Hadoop never really broke out the developer community.

Data insights for the masses

There are millions of people who are capable of analyzing data and who strive to be analytical in their workplace. 

But, the fact is, many of the existing tools we have require data analysts and data scientists to act as computer scientists is a failure on the part of software engineers. 

 

For data to be useful for the masses, scalability isn’t the only challenge that needs to be addressed. Making sure an analytics stack is both powerful and flexible--and also accessible for any who need the data--is an important challenge that I and many others are still continuing to tackle.

 

That’s why the product developers from Facebook wrote Scuba, which is now the technology we provide organizations across the world to make data more accessible. Scuba is purpose-built for ad hoc analysis by both non-developers and developers. It provides a web user interface and was tuned for returning results in interactive times. 

 

Want to learn more? Learn how Scuba can help your organization utilize interactive analytics that can grow your business strategically.