This blog covers my experience in doing the Coursera online course on
‘Hadoop Platform and Application Framework’ by University of San Diego
Week 1 Big Data Hadoop
Stack (4/18/2016).
KEY TAKEAWAYS
- There is a very large and sophisticated public domain tool set available under the HADOOP eco-system.
- The tools are in the form of a layered architecture called ‘stack’ with the data storage at the bottom, resource management and programming tools such as Map Reduce at the next layer,and higher level tools for accumulating, manipulating structured and unstructured data such as Hive, Pig, HBase, SQOOP etc.
- Using Virtual Machines running on either Oracle Virtual Box or VMWare one can very quickly get a flavor for how a real life application can work.
- The VM tutorial give you a pretty good feel for how a whole problem=solution implementation looks like – however, you need to understand individual command syntax and lots of practice before you can create your own applications.
- It is neat to see how structured data such as order sales history when integrated with unstructured data such as web click streams can lead to interesting value creating insights.
- There were lot of times when the VM machine filled my screen with errors, but most of them were caused by typos at my end rather than problems with the tutorials or documentation. The folks at Coursera and Cloudera have done a fine job.
THE DETAILS
After doing the ‘Introduction to Big Data’ course and
associated exercises with Hadoop, I found Week 1tougher than I thought it would
be. There was a lot of basic description of the Hadoop stack including
historical information. (Hadoop was the name of a toy elephant belonging to the son of developer Doug
Cutting). The Hadoop stack (or layers of software tools) consists of (but not
limited to):
HDFS – the distributed file system which stores data on
various nodes of a commodity cluster.
Map Reduce – programming model - does distribution of data
to various nodes and reduces the data
Yarn – higher level resource scheduler
Pig – scripting language
Sqoop – a translator of SQL data to Hadoop so that one can
import structured data seamlessly into the parallel processing architecture.
Hive – a query environment.
We looked at various implementations of the Hadoop stack at
major Internet companies such as Yahoo, Google, Facebook, LinkedIn etc.
As part of week 1 we had to pass a 10 question quiz which
covered the items we had. I got 1 wrong out of 10 and passed– I had to have at
least 80 % right so I felt good!
We downloaded the Cloudera distribution of Hadoop in the
form of a prebuilt virtual machine onto a Oracle Virtual Box. I am running it
on an iMac running i5 2.7 Ghz processor and 8 GB of RAM with MacOs 10.11.4 (El
Capitan). Given the Quad processor CPU and the 8 GB of RAM, I felt no
performance hits that people can sometimes feel operating in the Virtual
Machine.
The course goes through a dizzying array of application
description. While your head is abuzz with a menagerie of animal terms you are
now asked to work through tutorial examples of the Cloudera VM.
The first step was getting familiarized with the VM, the
browser, the terminal etc. which didn’t look that different than a standard
Unix set up. However, things got dicey
pretty quickly.
As a first step we were asked to import data from a SQL
database that showed a a fictitious business’ order history, products,
customers etc. using SQOOP to Hive. Unfortunately, I couldn’t figure out how to
copy and paste the commands from the tutorial web page to the terminal. Thus I
had to manually type in long strings of commands. This generated a lot of typos
and a string of errors. But I persevered on and finally was able to create a series of files in HDFS
transferring the data in its raw format from the SQL tables to HDFS. We also
copied the data schema into a separate set of files called Avro Schema files. One of the benefits of manually typing all
the commands is that it slowed me down and made me think of what each of the
terms was doing, rather than blindly copy-paste – a blessing in disguise!
We then queried the data using a SQL type of script in Hive
to generate top 10 products sold by the business. It was neat to see a real
life business application working on a Hadoop platform.
About this point I figured out how to copy and paste
commands from the tutorial to the Cloudera VM – Hallelujah!
The next step was to demonstrate the power of Big Data
integration. We copied about 1 month of previously collected web logs that
showed what items visitors were clicking on, through web-stream access log
data. The file has almost 20 million lines.
This data is collected by FLUME a service that collects, aggregates and
moves large volume of data. We used a command line JDBC (Java Data Base
Connector) environment called “Beeline” (to connect to (Bee) Hive – get it!) to
create tables in Hive about the access files.
BENEFITS OF COMBINING STRUCTURED AND UNSTRUCTURED DATA
We then ran a query on the top items accessed by users to
the online web site of the company (whose customer order data we were
originally looking at). Most of the top 10 items sold matched with top 20 or so
items visited on the web site. There was a notable exception – there were some
kids sneakers which was the 2nd most frequently accessed item.
However, nothing was being sold. Further review showed that the price had been
miskeyed. On correction of the price, the item started selling well. Thus
integrating unstructured data with organization’s structure data such as web
access logs, provided valuable insights. This resulted in significant increase
in sales.
No comments:
Post a Comment