A Newbie’s adventure into the Hadoop Ecosystem - Week 2
This week was pretty intense with three main topics covered:
a) Hadoop Stack b) Execution Framework c) Survey of key Hadoop based
applications.
Key Takeaways
- Learned about the evolution of Hadoop framework from 1.0 to 2.0.
- We looked at various execution layers within Hadoop – Yarn, Tez and Spark and how they are layered on top of HDFS. We identified where it makes sense to use Tez/Spark vs Yarn.
- We reviewed Hadoop resource scheduling using Fiarshare and Capacity Scheduling.
- We reviewed in detail, what Pig, Hive and Hbase are? We did a demo of each of these applications in a corresponding interactive environment (Grunt, Beeline and Hbase shell) and often loaded data, created tables, displayed or queried the table and wrote/stored output.
This week was much more demo driven with the instructor asking us to familiarize ourselves, by doing on a Cloudera Virtual Machine, rather than just showing slides.
Hadoop Stack
We started off by revisiting the basic Hadoop framework:
HDFS, MapReduce etc. We learned about the evolution of Hadoop 1.0 to Hadoop 2.0
where Yarn replaced MapReduce as the basic resource allocation engine. It showd
how the evolution of Hadoop 2.0 led to a much more scalable architecture than
can handle 100s to 1000s of nodes and also improved resiliency.
Hadoop Execution Framework
The second lesson focused on the Execution Framework. It
showed how HDFS, Yarn and Tez and Spark are layered on top of each other.
Application layers such as Pig and Hive function through the Tez or Spark
layer. There is detailed discussion of
Yarn, Tez and Spark.
We looked at example where it is much more efficient to use
Tez or Spark for cyclic data flow. In
some cases Spark runs 100x faster than Yarn due to its ability to cache data
instead of writing it to HDFS every time. Spark’s ability to be called from
higher level languages such as Java, R and Python and Scala (I have never used
Scala, R and sparingly of Java, so I was glad to see a familiar name) is also
very convenient.
There was an in depth discussion of Hadoop resource
scheduling. The traditional First In First Out (FIFO) model may not work well
if a series of small jobs are scheduled after a big job. Two other scheduling
approaches are described – Fairshare – balancing resources based on
applications and resources are balanced over time. There is also Capacity
Scheduling – where queues and sub-queues
are created and jobs are assigned to queues.
Even though some of this technical stuff was deep I have
been able to mostly keep up scoring close to 100 % in the quizzes. But I am learning a lot about actual mapping
of data and tasks on to individual computer nodes and issues in execution,
scheduling and how ability to cache data improves performance.
Hadoop Based Applications
The last lesson was spent on an introduction of the Hadoop
based Applications. We covered the main applications running under Hadoop for
databases, querying (Hive, Cassandra), Machine Learning (Mahout) and Graph
Processing (Giraph). Some we went into greater detail.
Pig – is a
platform for data processing and uses a higher level scripting language called Pig Latin. We demoed pig using an
interactive environment called ‘grunt.’ (Pig – Grunt I get it). We loaded the
file from HDFS, manipulated the file by copying a subset of the data and then
saving it back to the HDFS.
Hive – We then
studied Hive for interactive querying of data. It is a useful tool for data
mining, learning etc. We used an interactive environment called Beeline. We created a table with a
series of column headers. We loaded a file from HDFS to this table. We subset
using Map Reduce 4 columns out of the 5-6 total and displayed them in ascending
order for one of the column fields.
HBase – is a
non-relational database that sits on top of HDFS. It is a scalable data store.
We used an interactive HBase shell to create a table. We were asked to manually
put in 3 rows and 3 columns in the table. We then were asked to display one
column of all 3 rows. I got tired of all the commands to entry data so I never
finished the interactive assignment but took their word for it that it would
work.
There was a six-question quiz at the end. I assumed it was
open book and went back to the slides several times. So I scored 100 %, it
being testament to knowing at least where to look if not knowing per se.
PS: One of the benefits of writing a blog every week is that
it is forcing me to go back and review the slides – if I cannot write a simple
summary of what I have learned I must not have learned it right. This along
with the quizzes is keeping me intellectually honest and somewhat accountable.
This is a unique experience as this is the firs time I am taking an online
course – I always thrived on peer competition and there isn’t any – its just
you vs. your laziness.