Skip to main content
search

© 2021 Excellerate. All Rights Reserved

Big Data

Common Hadoop Pitfalls

By October 28, 2014June 10th, 2021No Comments

Hadoop is a Great Data Platform — that is as long as you realize what it is not. There is hype exaggerating Hadoop’s capabilities, leading to Hadoop being used at places where it does not excel. In this blog we will see:

  • What Hadoop is?
  • What Hadoop is not?
  • And what are the common pitfalls or misconception around it.

What is Hadoop?
HadoopHadoop is a Data Operating System. With Hadoop 2.0, Hadoop graduated from a Map Reduce/HDFS platform from Batch Processing to a more complete Data Operating System. 
In Hadoop 2.0, Map Reduce, Pig and Hive are just a few of many Data Applications. This is due to the advent of YARN Layer in Hadoop 2.0. 
YarnYARN transformed the earlier Map Reduce based Batch Processing Engine into a Distributed Data Processing Platform. Newer applications like Hbase, Storm, Spark, Solr, etc. now run on top of YARN. Hadoop 2.0 now supports the following:

  • Batch Processing
  • Interactive Queries
  • Stream Processing
  • Machine Learning
  • Graph Processing

What Hadoop is NOT?
Much confusion around Hadoop arises due to the fact that Hadoop is increasingly used to replace/augment previous Data Warehouse Systems, Data Integration Tools or ETL Systems.

Pitfalls of Hadoop

Using Hadoop for Data Integration/ETL
Data WarehouseTraditional Data Warehouse Systems, Data Integration or ETL Systems were more commercial in nature and were meant to solve the problem of:

  • Moving large data from various sources to a central place
  • Processing large data by streaming it through a processing unit

Hadoop on other hand is good at the following:

  • Storing Big Data in a cost effective manner
  • Processing Big Data already stored in Hadoop

This is depicted below:

Hadoop-Strengths

While Hadoop has tools for Data Integration, it is not a Data Integration tool. Hadoop has limited capabilities in terms of Data Integration with tools like Sqoop and Flume. Sqoop is the tool to extract data from databases and move to HDFS. Flume is the tool to extract data-like log files-and move to HDFS. Sqoop is limited in its capability to extract data from databases, especially for a scenario like Change Data Capture (that can incrementally detect and load changes from database to HDFS).
Once the data is stored, it is transformed for various reasons. One reason is Change Data Capture. In this case, data that changed in the database within the last 24 hours is extracted; and data already stored in Hadoop is patched to sync it with data in source. This is called Change Data Capture at Source.
Another scenario is that complete data is loaded on daily basis and a diff is done to determine the change. This is called Change Data Capture at Hadoop.
Traditional Data Warehouse/ETL tools provided out of the box transformation steps which made it very easy to transform data. When it comes to Hadoop, Hadoop provides options like Map Reduce, Hive, Pig, etc. to transform the data; but this data transformation has to be hand coded.
In the world of Change Data Capture and Data Integration, Hadoop’s tool falls short.  Many of the traditional data integration products from Informatica Power CenterIBM Data Sphere, etc. excel at addressing this problem. Many of them have now started supporting Hadoop, too. Also there are newer solutions from LinkedIn (“LinkedIn Data Bus”) which has a better resolve at this problem than Hadoop. Both of these tools provide Change Data Capture at Source. Sync Sort provides Change Data Capture at Hadoop.
Hadoop is Cheap (Inexpensive)
The statement “Hadoop is Cheap” (meaning inexpensive) is applicable when comparing the cost of infrastructure required to setup a Hadoop Cluster vs. traditional systems. While this is true, note that Hadoop is based on Open Source technology.
While there are commercial distributions (Cloudera, Hortonworks, MapR, Pivotal, EMR, etc.), these are wrappers over the Open Source Hadoop with the following:

  • Bug fixes on top of Open Source
  • Visual tools to manage Hadoop
  • Tools around security, governance, provisioning and monitoring
  • These commercial distributions have the additional cost of license or support.

In addition, even with the above options, you have to do quite a lot of Custom Code Development to extract real value out of Hadoop. Hadoop is a complicated ecosystem; and it is difficult to find seasoned experts at Map Reduce, Spark, etc.
Add to the role of Dev Ops the role(s) to manage, deploy and monitor the Hadoop cluster. You need either an in house team or need a Software As a Service providing you Hadoop.
Add up all these costs and you now realize that Hadoop is not “cheap” (inexpensive).
Using Hadoop in Silos
Compare that you don’t use a school bus to drive to work; nor would you use Hadoop to do small things. While in reality, nothing stops you from doing either, there is quite a waste in using these approaches.
Think of car vs. a bus: A car serves you purpose for your individual needs. A bus serves a greater purpose, such as carrying more people to common destination. Using this comparison, Hadoop is a “bus” and you need to treat it like one. With Hadoop 2.0, it’s really a Data Operating System. You need to think in terms of Platform.
Each product silo is focused on its own need; and hence, the organization has products using different distributions of Hadoop – some on premise and some on cloud. The reasoning is “things will evolve and they will all ride single platform in future”.
This is where I caution people to understand the world “platform” better. A platform is shared across product; a platform is available for everyone to ride on it.
Avoid going with silos. Evaluate the need for defining your platform with the future in mind and develop it iteratively.
Hadoop-2.0Remember, Hadoop is a bus; and it works best when it’s connected to various data sources and targets. The effort in doing so is large, but if you decide on single platform, the effort is only once. Once the connection is in place, Hadoop will have all the data required for your various applications. Then it’s just a matter of running different distributed applications on the platform.
Also remember, different applications have different needs: some will be batch, some streaming, some interactive, some in form of database. The positive thing is that Hadoop 2.0 can support them all.
My conclusion: don’t start on Hadoop in silo. Evaluate together, plan together, ride together. It’s a bus taking multiple people to their destination.
Limiting yourself to Java
Often Product Development companies adhere to Single Technology Stack. There are many reasons to do so, mainly because its more business focused.
If you are a Java shop, the good news is that the Hadoop stack is a Java stack. This makes it’s easier to start things. But if you look closely into the nature of emerging technologies like Hadoop and others, they are really polygot in nature.
New platforms supports different languages for single reason; certain languages are better at a particular problem than other. Here are few examples:
Scala is better for functional needs like Map and Reduce. A Map Reduce Program can be written in Scala in just a few lines
R is better at Analytics, and so is Python
Apache Spark is an emerging technology, where its strength lies in transforming the RDDs using functional paradigms. Apache Spark uses Scala as its primary language. It also supports Java, but if you take the Java route the readability is completely lost.
Also remember, since Hadoop 1.0, Map Reduce has supported other languages. This is called Streaming in Map Reduce, which means traditional pipes are used to pass data in and out to any program and use that program in Map Reduce phases. This is how Map Reduce can be written in Python and not only Java.
When you are using Hadoop you are solving different problems than your traditional web application or desktop application world. Be open to newer languages fit for their problem domains when you have opened up for Hadoop.