Hemant's Blog: May 2016

Sunday, May 22, 2016

Week 5: Introduction to Spark - and its (almost) over!

Ok to be honest, I am pretty tired writing weekly summaries of what I learned, especially with the heavy programming workload the past two weeks have entailed. But I think for the sake of completeness I will post a few highlights.

Spark is an interactive environment that is more interactive than Hadoop Streaming and provides a superior framework than MapReduce (which we learnt last week). It is especially true when there are iterative algorithms involved such as for machine learning. It also supports a number of languages such as Python, Scala etc. rather than just Java.
We looked at Spark architecture in a Cloudera VM on Amazon EC3 for e.g. There is a cluster manager and a bunch of worker nodes which has many partition of Python executors.
We learnt to fire off PySpark the interactive Spark environment with Python Command Line Interpreter.
Lesson 2 was all about Resilient Distributed Datasets (RDD). These are Spark data constructs which store blocks of data. They have resiliency as such that they can be recreated if a node fails. RDDs are immutable. One creates a new RDD by performing transform operations.
We talked about transform operations or transformations. There are narrow transformations which are not very heavy on network usage such as map, flatmap, coalesce, select etc. There are wide transformations such as ‘groupByKey’, ‘reduceByKey’ and repartition. These cause huge movement of data between nodes and are memory/compute intensive. As part of this lecture’s homework we did a simple ‘Join’ operation of two sets of data in Spark using the PySpark environment. I had no trouble passing this.
The third lesson in the series was all about jobs scheduling, actions, caching data and use of shared variables. We learned about the ‘Directed Acyclic Graph Scheduler (DAG) and how it allows Spark to keep track of execution pipeline. Interestingly most Spark transformations are negative in the sense they are not triggered until certain actions such as collect().
The final assignment was a programming assignment where we had to go back to the channel, shows, viewers data from week 4 and redo the assignment using the Spark interactive framework. This assignment was tough and took me a better part of my Sunday. But in the end I was able to upload my answer – total viewership for a particular channel and get 100/100. In getting this one write I had to be shameless and try a lot of things many times ending in ‘epic-fails.’ I also had to repeated go back to past lessons and identify course material relevant to the problem at hand. In the end this assignment was way more completed than should be expected for a business leader to solve – but I am glad I raised my work to meet the expectations of this week.
I still have a final quiz to pass on the Section 3. After numerous tries (and repeated review of course notes and videos) I am only able to get 4 out of 6 right. I need to get at least 5/6 right. I have (I think) until the end of the week to get this right so that I can pass the course with 16/16 assignments. So far I have 15/16 done.

Overall Observation

Overall I learnt a lot in this course and really pushed my ability to code and apply the course material to actual programming problems. Even after such grueling assignments it is humbling to know that I have far from a working knowledge to use Spark/Hadoop for rigorous data modeling. I just know what questions to ask!

Tuesday, May 17, 2016

Week 4: Map Reduce and the Challenging yet Successful week coding Mapper/Reducer for a JOIN operation

Week 4 – MapReduce

KEY TAKEAWAYS

Learned about the MapReduce framework on Hadoop.
We looked at a simple word count program to understand how a mapper works and how a reducer works and how you get different results with different number of reducers.
We learnt about JOIN where you collect two sets of data and join them by finding common key values and get meaningful insights out of the data.
There was a (what now feels like) pretty easy assignment that was used to demonstrate a JOIN for calculating network viewership by show. We had to code the mapper and reducer in Python and also device the algorithm. After several days of struggle I was finally able to make it work.
Sometimes bugs/difficulties are a blessing as it really forces you to learn all the tricks of the trade on how to debug code, really understand what each line in example code works and also marshal resources.
There isn’t a lot of TA support on the Coursera online courses

General Info on MapReduce

MapReduce is a key framework for handling and processing large amounts of unconnected data that is relatively static and does not change. Hadoop does that by mapping the data into a number of nodes and performing a mapping function – which is assigning a key and a value to create a pair.

We studied in this lesson how map function works. Then after all this data is mapped how a reducer work which shuffles all the data and sums up the values for all keys. We applied this for a simple example of word counts for words in a Star Wars line.

We ran the Word Count program by writing the mapper and reducer in Python and specifying the mapper and reducer as the python files in Hadoop streaming commands.

We did an example for a word count with a single reducer, zero reducer (it just does mapping) and two reducers (generates two output files).

I had reviewed the lectures on Monday and ran the code on Thursday. I took the quiz on Thursday and was happy to see retaining most of my knowledge and getting 7/7 write.

Time to fool myself is over – onward to more complex examples of Map/Reduce.

JOIN and Programming Assignment

We learnt about how to join data from two separate data sets using one common key to get some meaningful insight. As part of the assignment we ran an example program with a mapper and a reducer to practice learning testing of individual mapper and reducer code and also understand Hadoop Streaming command syntax. The final assignment was to do a join that built on previous join examples. There were two sets of data – one set showed a set of TV programs with viewership data. The second set showed TV programs for many channels. We had to count the total viewership by channel and then summarize it for a particular channel in this case “ABC.”

It took me a couple of days (many hours each day) just to understand the assignment and figure out the structure of the mapper and reducer programs and how they function in the map reduce framework. I had a very busy weekend and the assignment was due at midnight on Sunday. I worked till midnight but still couldn’t get the totals of the first two shows to match up with the hint provided in the assignment. It was a long and frustrating evening but I also learnt a lot.

On Monday I was a day late but I kept plodding but my totals kept coming short. Finally on Tuesday I created the data file and mapper output and output it to excel and verified that the totals from my reducer were the same as what excel was calculating.

I then made sure that the data generated (using a pseudo-random number data generator) was identical no matter what platform I used. My nephew Pranav Dandekar(who I suspect has done a lot of map reduce work in his life) also volunteered to verify the data generator on his machine as well as AWS. So I ran the data on my Virtual Machine (VM), my Mac and also on my Raspberry PI. Same data – so I know the problem was with my mapper. After looking at my code for the 1000^th time I realized that I was only outputting a 3 digit channel viewership as a result the mapper was not outputting lines that had 4 digit viewer numbers. Once I fixed it everything went like clockwork. I submitted the result and got 100/100 instantly. All that is well, ends well!

(See picture below from a blog of http://xiaochongzhang.me that demonstrates the MapReduce steps for a simple word count)

Wednesday, May 4, 2016

Week 3: Introduction to Hadoop Distributed File System (HDFS)

HDFS Architecture

Started off with review of the HDFS architecture. What are the design considerations for HDFS (high scalability, robustness to hardware failure, portability across software/hardware platform) and how they are implemented (using a name node/data node architecture with data replication)?

We looked at the impact of file size on process efficiency. A large number of small files are bad for the performance. It increased network traffic, requires more name node memory and dramatically increases the number of maps.

We studied how a write operations to the HDFS takes place using the name node/data node. It stressed the importance of replication using a pipeline process to speed up write tasks. Similarly talked about how a read task is accomplished.

HDFS Performance, Tuning Parameters and Robustness

We then looked into HDFS performance and robustness. HDFS allows one to set parameters such as replication factor (how many copies of the block data are made for redundancy) and block size (typically 64 MB but could be 128 MB).

The default replication factor is 3. If the replication is reduced, you gain in performance and use lower data but you give up robustness. If replication factor is one then you don’t have any backup in case a file is corrupted or the data node fails for some reason. They showed an example where the replication factor was reduced from 3 to 1 and the data rate increased from 779 MB/sec to 2995 MB/sec.

The lecture showed how the system recovers when a data node fails – the name node does not detect the heart beat – it stops all i/O and switches to a replicated block and adds another replicated block to keep the replicator factor value. If there is a name node failure then the operator has to manually start a duplicate name node using backup meta data.

This was a very interesting lesson as for the first time I learned how HDFS deals with data node and name node failure. It also highlighted the tradeoffs between performance and robustness and some of the parameters that can be changed in HDFS.

HDFS Access Command Line and APIs

I falsely assumed that there were only 2 lessons in Week 3. In fact there were 3! The last lesson talked about how one can access HDFS through Command Line Interface & APIs so that it can be accessed programmatically through Java and Web REST. We can also access HDFS through Apache Flume – collecting streaming data and Apache Sqoop – bulk transfer of data between Hadoop and a data set.

We practices some command line HDFS comments such as ls (directory listing), mkdir (create a subdirectory), put (store) and get ( retrieve) data from HDFS to local directory. We also looked at some HDFS administration report generating commands.

We saw examples of HDFS classes needed and additional classes and methods available. We also looked at implementation of certain commands using Web REST APIs using the ‘curl’ method and showed that they are equivalent to the ‘get’, ‘put’ & ‘ls’ commands in the command line interface.

Learned some interesting lingo such as lets ‘spin the VM.’ I had a hard time working on the Java section as I have not done much Java programming and REST API is also alien to me.

In the end I took the quiz and I got 5 out of 6 write – not bad! Onward and upward to MapReduce.

Hemant's Blog