Friday, April 29, 2016

A Newbie’s adventure into the Hadoop Ecosystem - Week 2



This week was pretty intense with three main topics covered: a) Hadoop Stack b) Execution Framework c) Survey of key Hadoop based applications.

Key Takeaways

  1. Learned about the evolution of Hadoop framework from 1.0 to 2.0.
  2. We looked at various execution layers within Hadoop – Yarn, Tez and Spark and how they are layered on top of HDFS. We identified where it makes sense to use Tez/Spark vs Yarn.
  3. We reviewed Hadoop resource scheduling using Fiarshare and Capacity Scheduling.
  4. We reviewed in detail, what Pig, Hive and Hbase are? We did a demo of each of these applications in a corresponding interactive environment (Grunt, Beeline and Hbase shell) and often loaded data, created tables, displayed or queried the table and wrote/stored output.


This week was much more demo driven with the instructor asking us to familiarize ourselves, by doing on a Cloudera Virtual Machine, rather than just showing slides.

Hadoop Stack 


We started off by revisiting the basic Hadoop framework: HDFS, MapReduce etc. We learned about the evolution of Hadoop 1.0 to Hadoop 2.0 where Yarn replaced MapReduce as the basic resource allocation engine. It showd how the evolution of Hadoop 2.0 led to a much more scalable architecture than can handle 100s to 1000s of nodes and also improved resiliency.

Hadoop Execution Framework


The second lesson focused on the Execution Framework. It showed how HDFS, Yarn and Tez and Spark are layered on top of each other. Application layers such as Pig and Hive function through the Tez or Spark layer.  There is detailed discussion of Yarn, Tez and Spark.
We looked at example where it is much more efficient to use Tez or Spark for  cyclic data flow. In some cases Spark runs 100x faster than Yarn due to its ability to cache data instead of writing it to HDFS every time. Spark’s ability to be called from higher level languages such as Java, R and Python and Scala (I have never used Scala, R and sparingly of Java, so I was glad to see a familiar name) is also very convenient.

There was an in depth discussion of Hadoop resource scheduling. The traditional First In First Out (FIFO) model may not work well if a series of small jobs are scheduled after a big job. Two other scheduling approaches are described – Fairshare – balancing resources based on applications and resources are balanced over time. There is also Capacity Scheduling – where  queues and sub-queues are created and jobs are assigned to queues.

Even though some of this technical stuff was deep I have been able to mostly keep up scoring close to 100 % in the quizzes.  But I am learning a lot about actual mapping of data and tasks on to individual computer nodes and issues in execution, scheduling and how ability to cache data improves performance.

Hadoop Based Applications



The last lesson was spent on an introduction of the Hadoop based Applications. We covered the main applications running under Hadoop for databases, querying (Hive, Cassandra), Machine Learning (Mahout) and Graph Processing (Giraph). Some we went into greater detail.

Pig – is a platform for data processing and uses a higher level scripting language called Pig Latin. We demoed pig using an interactive environment called ‘grunt.’ (Pig – Grunt I get it). We loaded the file from HDFS, manipulated the file by copying a subset of the data and then saving it back to the HDFS.

Hive – We then studied Hive for interactive querying of data. It is a useful tool for data mining, learning etc. We used an interactive environment called Beeline. We created a table with a series of column headers. We loaded a file from HDFS to this table. We subset using Map Reduce 4 columns out of the 5-6 total and displayed them in ascending order for one of the column fields.

HBase – is a non-relational database that sits on top of HDFS. It is a scalable data store. We used an interactive HBase shell to create a table. We were asked to manually put in 3 rows and 3 columns in the table. We then were asked to display one column of all 3 rows. I got tired of all the commands to entry data so I never finished the interactive assignment but took their word for it that it would work.

There was a six-question quiz at the end. I assumed it was open book and went back to the slides several times. So I scored 100 %, it being testament to knowing at least where to look if not knowing per se.

PS: One of the benefits of writing a blog every week is that it is forcing me to go back and review the slides – if I cannot write a simple summary of what I have learned I must not have learned it right. This along with the quizzes is keeping me intellectually honest and somewhat accountable. This is a unique experience as this is the firs time I am taking an online course – I always thrived on peer competition and there isn’t any – its just you vs. your laziness.


No comments:

Post a Comment