Monday, April 25, 2016

A newbie's adventures in the Hadoop ecosystem - Week 1




This blog covers my experience in doing the Coursera online course on
‘Hadoop Platform and Application Framework’  by University of San Diego

Week 1 Big Data Hadoop Stack (4/18/2016).

KEY TAKEAWAYS


  1. There is a very large and sophisticated public domain tool set available under the HADOOP eco-system.
  2. The tools are in the form of a layered architecture called ‘stack’ with the data storage at the bottom, resource management and programming tools such as Map Reduce at the next layer,and higher level tools for accumulating, manipulating structured and unstructured data such as Hive, Pig, HBase, SQOOP etc.
  3. Using Virtual Machines running on either Oracle Virtual Box or VMWare one can very quickly get a flavor for how a real life application can work.
  4. The VM tutorial give you a pretty good feel for how a whole problem=solution implementation looks like – however, you need to understand individual command syntax and lots of practice before you can create your own applications.
  5. It is neat to see how structured data such as order sales history when integrated with unstructured data such as web click streams can lead to interesting value creating insights.
  6. There were lot of times when the VM machine filled my screen with errors, but most of them were caused by typos at my end rather than problems with the tutorials or documentation. The folks at Coursera and Cloudera have done a fine job.



THE DETAILS

After doing the ‘Introduction to Big Data’ course and associated exercises with Hadoop, I found Week 1tougher than I thought it would be. There was a lot of basic description of the Hadoop stack including historical information. (Hadoop was the name of a toy elephant  belonging to the son of developer Doug Cutting). The Hadoop stack (or layers of software tools) consists of (but not limited to):

HDFS – the distributed file system which stores data on various nodes of a commodity cluster.
Map Reduce – programming model - does distribution of data to various nodes and reduces the data
Yarn – higher level resource scheduler
Pig – scripting language
Sqoop – a translator of SQL data to Hadoop so that one can import structured data seamlessly into the parallel processing architecture.
Hive – a query environment.

 

We looked at various implementations of the Hadoop stack at major Internet companies such as Yahoo, Google, Facebook, LinkedIn etc.

As part of week 1 we had to pass a 10 question quiz which covered the items we had. I got 1 wrong out of 10 and passed– I had to have at least 80 % right so I felt good!


We downloaded the Cloudera distribution of Hadoop in the form of a prebuilt virtual machine onto a Oracle Virtual Box. I am running it on an iMac running i5 2.7 Ghz processor and 8 GB of RAM with MacOs 10.11.4 (El Capitan). Given the Quad processor CPU and the 8 GB of RAM, I felt no performance hits that people can sometimes feel operating in the Virtual Machine.

The course goes through a dizzying array of application description. While your head is abuzz with a menagerie of animal terms you are now asked to work through tutorial examples of the Cloudera VM.

The first step was getting familiarized with the VM, the browser, the terminal etc. which didn’t look that different than a standard Unix set up.  However, things got dicey pretty quickly.

As a first step we were asked to import data from a SQL database that showed a a fictitious business’ order history, products, customers etc. using SQOOP to Hive. Unfortunately, I couldn’t figure out how to copy and paste the commands from the tutorial web page to the terminal. Thus I had to manually type in long strings of commands. This generated a lot of typos and a string of errors. But I persevered on and finally  was able to create a series of files in HDFS transferring the data in its raw format from the SQL tables to HDFS. We also copied the data schema into a separate set of files called Avro Schema files.  One of the benefits of manually typing all the commands is that it slowed me down and made me think of what each of the terms was doing, rather than blindly copy-paste – a blessing in disguise!

We then queried the data using a SQL type of script in Hive to generate top 10 products sold by the business. It was neat to see a real life business application working on a Hadoop platform.

About this point I figured out how to copy and paste commands from the tutorial to the Cloudera VM – Hallelujah!

The next step was to demonstrate the power of Big Data integration. We copied about 1 month of previously collected web logs that showed what items visitors were clicking on, through web-stream access log data. The file has almost 20 million lines.  This data is collected by FLUME a service that collects, aggregates and moves large volume of data. We used a command line JDBC (Java Data Base Connector) environment called “Beeline” (to connect to (Bee) Hive – get it!) to create tables in Hive about the access files.

BENEFITS OF  COMBINING STRUCTURED AND UNSTRUCTURED DATA


We then ran a query on the top items accessed by users to the online web site of the company (whose customer order data we were originally looking at). Most of the top 10 items sold matched with top 20 or so items visited on the web site. There was a notable exception – there were some kids sneakers which was the 2nd most frequently accessed item. However, nothing was being sold. Further review showed that the price had been miskeyed. On correction of the price, the item started selling well. Thus integrating unstructured data with organization’s structure data such as web access logs, provided valuable insights. This resulted in significant increase in sales.






No comments:

Post a Comment