Tuesday, May 17, 2016

Week 4: Map Reduce and the Challenging yet Successful week coding Mapper/Reducer for a JOIN operation

Week 4 – MapReduce


KEY TAKEAWAYS


  1. Learned about the MapReduce framework on Hadoop.
  2. We looked at a simple word count program to understand how a mapper works and how a reducer works and how you get different results with different number of reducers.
  3. We learnt about JOIN where you collect two sets of data and join them by finding common key values and get meaningful insights out of the data.
  4. There was a (what now feels like) pretty easy assignment that was used to demonstrate a JOIN for calculating network viewership by show. We had to code the mapper and reducer in Python and also device the algorithm. After several days of struggle I was finally able to make it work. 
  5. Sometimes bugs/difficulties are a blessing as it really forces you to learn all the tricks of the trade on how to debug code, really understand what each line in example code works and also marshal resources.
  6. There isn’t a lot of TA support on the Coursera online courses 

General Info on MapReduce


MapReduce is a key framework for handling and processing large amounts of unconnected data that is relatively static and does not change. Hadoop does that by mapping the data into a number of nodes and performing a mapping function – which is assigning a key and a value to create a pair.

We studied in this lesson how map function works. Then after all this data is mapped how a reducer work which shuffles all the data and sums up the values for all keys. We applied this for a simple example of word counts for words in a Star Wars line.

We ran the Word Count program by writing the mapper and reducer in Python and specifying the mapper and reducer as the python files in Hadoop streaming commands.
We did an example for a word count with a single reducer, zero reducer (it just does mapping) and two reducers (generates two output files).

I had reviewed the lectures on Monday and ran the code on Thursday. I took the quiz on Thursday and was happy to see retaining most of my knowledge and getting 7/7 write.

Time to fool myself is over – onward to more complex examples of Map/Reduce.


 JOIN and Programming Assignment


We learnt about how to join data from two separate data sets using one common key to get some meaningful insight. As part of the assignment we ran an example program with a mapper and a reducer to practice learning testing of individual mapper and reducer code and also understand Hadoop Streaming command syntax. The final assignment was to do a join that built on previous join examples. There were two sets of data – one set showed a set of TV programs with viewership data. The second set showed TV programs for many channels. We had to count the total viewership by channel and then summarize it for a particular channel in this case “ABC.”

It took me a couple of days (many hours each day) just to understand the assignment and figure out the structure of the mapper and reducer programs and how they function in the map reduce framework. I had a very busy weekend and the assignment was due at midnight on Sunday. I worked till midnight but still couldn’t get the totals of the first two shows to match up with the hint provided in the assignment. It was a long and frustrating evening but I also learnt a lot.

On Monday I was a day late but I kept plodding but my totals kept coming short. Finally on Tuesday I created the data file and mapper output and output it to excel and verified that the totals from my reducer were the same as what excel was calculating.


I then made sure that the data generated (using a pseudo-random number data generator) was identical no matter what platform I used. My nephew Pranav Dandekar(who I suspect has done a lot of map reduce work in his life) also volunteered to verify the data generator on his machine as well as AWS. So I ran the data on my Virtual Machine (VM), my Mac and also on my Raspberry PI. Same data – so I know the problem was with my mapper. After looking at my code for the 1000th time I realized that I was only outputting a 3 digit channel viewership as a result the mapper was not outputting lines that had 4 digit viewer numbers. Once I fixed it everything went like clockwork. I submitted the result and got 100/100 instantly. All that is well, ends well!

(See picture below from a blog of http://xiaochongzhang.me that demonstrates the MapReduce steps for a simple word count)

No comments:

Post a Comment