Friday, April 29, 2016

A Newbie’s adventure into the Hadoop Ecosystem - Week 2



This week was pretty intense with three main topics covered: a) Hadoop Stack b) Execution Framework c) Survey of key Hadoop based applications.

Key Takeaways

  1. Learned about the evolution of Hadoop framework from 1.0 to 2.0.
  2. We looked at various execution layers within Hadoop – Yarn, Tez and Spark and how they are layered on top of HDFS. We identified where it makes sense to use Tez/Spark vs Yarn.
  3. We reviewed Hadoop resource scheduling using Fiarshare and Capacity Scheduling.
  4. We reviewed in detail, what Pig, Hive and Hbase are? We did a demo of each of these applications in a corresponding interactive environment (Grunt, Beeline and Hbase shell) and often loaded data, created tables, displayed or queried the table and wrote/stored output.


This week was much more demo driven with the instructor asking us to familiarize ourselves, by doing on a Cloudera Virtual Machine, rather than just showing slides.

Hadoop Stack 


We started off by revisiting the basic Hadoop framework: HDFS, MapReduce etc. We learned about the evolution of Hadoop 1.0 to Hadoop 2.0 where Yarn replaced MapReduce as the basic resource allocation engine. It showd how the evolution of Hadoop 2.0 led to a much more scalable architecture than can handle 100s to 1000s of nodes and also improved resiliency.

Hadoop Execution Framework


The second lesson focused on the Execution Framework. It showed how HDFS, Yarn and Tez and Spark are layered on top of each other. Application layers such as Pig and Hive function through the Tez or Spark layer.  There is detailed discussion of Yarn, Tez and Spark.
We looked at example where it is much more efficient to use Tez or Spark for  cyclic data flow. In some cases Spark runs 100x faster than Yarn due to its ability to cache data instead of writing it to HDFS every time. Spark’s ability to be called from higher level languages such as Java, R and Python and Scala (I have never used Scala, R and sparingly of Java, so I was glad to see a familiar name) is also very convenient.

There was an in depth discussion of Hadoop resource scheduling. The traditional First In First Out (FIFO) model may not work well if a series of small jobs are scheduled after a big job. Two other scheduling approaches are described – Fairshare – balancing resources based on applications and resources are balanced over time. There is also Capacity Scheduling – where  queues and sub-queues are created and jobs are assigned to queues.

Even though some of this technical stuff was deep I have been able to mostly keep up scoring close to 100 % in the quizzes.  But I am learning a lot about actual mapping of data and tasks on to individual computer nodes and issues in execution, scheduling and how ability to cache data improves performance.

Hadoop Based Applications



The last lesson was spent on an introduction of the Hadoop based Applications. We covered the main applications running under Hadoop for databases, querying (Hive, Cassandra), Machine Learning (Mahout) and Graph Processing (Giraph). Some we went into greater detail.

Pig – is a platform for data processing and uses a higher level scripting language called Pig Latin. We demoed pig using an interactive environment called ‘grunt.’ (Pig – Grunt I get it). We loaded the file from HDFS, manipulated the file by copying a subset of the data and then saving it back to the HDFS.

Hive – We then studied Hive for interactive querying of data. It is a useful tool for data mining, learning etc. We used an interactive environment called Beeline. We created a table with a series of column headers. We loaded a file from HDFS to this table. We subset using Map Reduce 4 columns out of the 5-6 total and displayed them in ascending order for one of the column fields.

HBase – is a non-relational database that sits on top of HDFS. It is a scalable data store. We used an interactive HBase shell to create a table. We were asked to manually put in 3 rows and 3 columns in the table. We then were asked to display one column of all 3 rows. I got tired of all the commands to entry data so I never finished the interactive assignment but took their word for it that it would work.

There was a six-question quiz at the end. I assumed it was open book and went back to the slides several times. So I scored 100 %, it being testament to knowing at least where to look if not knowing per se.

PS: One of the benefits of writing a blog every week is that it is forcing me to go back and review the slides – if I cannot write a simple summary of what I have learned I must not have learned it right. This along with the quizzes is keeping me intellectually honest and somewhat accountable. This is a unique experience as this is the firs time I am taking an online course – I always thrived on peer competition and there isn’t any – its just you vs. your laziness.


Monday, April 25, 2016

A newbie's adventures in the Hadoop ecosystem - Week 1




This blog covers my experience in doing the Coursera online course on
‘Hadoop Platform and Application Framework’  by University of San Diego

Week 1 Big Data Hadoop Stack (4/18/2016).

KEY TAKEAWAYS


  1. There is a very large and sophisticated public domain tool set available under the HADOOP eco-system.
  2. The tools are in the form of a layered architecture called ‘stack’ with the data storage at the bottom, resource management and programming tools such as Map Reduce at the next layer,and higher level tools for accumulating, manipulating structured and unstructured data such as Hive, Pig, HBase, SQOOP etc.
  3. Using Virtual Machines running on either Oracle Virtual Box or VMWare one can very quickly get a flavor for how a real life application can work.
  4. The VM tutorial give you a pretty good feel for how a whole problem=solution implementation looks like – however, you need to understand individual command syntax and lots of practice before you can create your own applications.
  5. It is neat to see how structured data such as order sales history when integrated with unstructured data such as web click streams can lead to interesting value creating insights.
  6. There were lot of times when the VM machine filled my screen with errors, but most of them were caused by typos at my end rather than problems with the tutorials or documentation. The folks at Coursera and Cloudera have done a fine job.



THE DETAILS

After doing the ‘Introduction to Big Data’ course and associated exercises with Hadoop, I found Week 1tougher than I thought it would be. There was a lot of basic description of the Hadoop stack including historical information. (Hadoop was the name of a toy elephant  belonging to the son of developer Doug Cutting). The Hadoop stack (or layers of software tools) consists of (but not limited to):

HDFS – the distributed file system which stores data on various nodes of a commodity cluster.
Map Reduce – programming model - does distribution of data to various nodes and reduces the data
Yarn – higher level resource scheduler
Pig – scripting language
Sqoop – a translator of SQL data to Hadoop so that one can import structured data seamlessly into the parallel processing architecture.
Hive – a query environment.

 

We looked at various implementations of the Hadoop stack at major Internet companies such as Yahoo, Google, Facebook, LinkedIn etc.

As part of week 1 we had to pass a 10 question quiz which covered the items we had. I got 1 wrong out of 10 and passed– I had to have at least 80 % right so I felt good!


We downloaded the Cloudera distribution of Hadoop in the form of a prebuilt virtual machine onto a Oracle Virtual Box. I am running it on an iMac running i5 2.7 Ghz processor and 8 GB of RAM with MacOs 10.11.4 (El Capitan). Given the Quad processor CPU and the 8 GB of RAM, I felt no performance hits that people can sometimes feel operating in the Virtual Machine.

The course goes through a dizzying array of application description. While your head is abuzz with a menagerie of animal terms you are now asked to work through tutorial examples of the Cloudera VM.

The first step was getting familiarized with the VM, the browser, the terminal etc. which didn’t look that different than a standard Unix set up.  However, things got dicey pretty quickly.

As a first step we were asked to import data from a SQL database that showed a a fictitious business’ order history, products, customers etc. using SQOOP to Hive. Unfortunately, I couldn’t figure out how to copy and paste the commands from the tutorial web page to the terminal. Thus I had to manually type in long strings of commands. This generated a lot of typos and a string of errors. But I persevered on and finally  was able to create a series of files in HDFS transferring the data in its raw format from the SQL tables to HDFS. We also copied the data schema into a separate set of files called Avro Schema files.  One of the benefits of manually typing all the commands is that it slowed me down and made me think of what each of the terms was doing, rather than blindly copy-paste – a blessing in disguise!

We then queried the data using a SQL type of script in Hive to generate top 10 products sold by the business. It was neat to see a real life business application working on a Hadoop platform.

About this point I figured out how to copy and paste commands from the tutorial to the Cloudera VM – Hallelujah!

The next step was to demonstrate the power of Big Data integration. We copied about 1 month of previously collected web logs that showed what items visitors were clicking on, through web-stream access log data. The file has almost 20 million lines.  This data is collected by FLUME a service that collects, aggregates and moves large volume of data. We used a command line JDBC (Java Data Base Connector) environment called “Beeline” (to connect to (Bee) Hive – get it!) to create tables in Hive about the access files.

BENEFITS OF  COMBINING STRUCTURED AND UNSTRUCTURED DATA


We then ran a query on the top items accessed by users to the online web site of the company (whose customer order data we were originally looking at). Most of the top 10 items sold matched with top 20 or so items visited on the web site. There was a notable exception – there were some kids sneakers which was the 2nd most frequently accessed item. However, nothing was being sold. Further review showed that the price had been miskeyed. On correction of the price, the item started selling well. Thus integrating unstructured data with organization’s structure data such as web access logs, provided valuable insights. This resulted in significant increase in sales.