Ok to be honest, I am pretty tired writing weekly summaries
of what I learned, especially with the heavy programming workload the past two
weeks have entailed. But I think for the sake of completeness I will post a few
highlights.
- Spark is an interactive environment that is more interactive than Hadoop Streaming and provides a superior framework than MapReduce (which we learnt last week). It is especially true when there are iterative algorithms involved such as for machine learning. It also supports a number of languages such as Python, Scala etc. rather than just Java.
- We looked at Spark architecture in a Cloudera VM on Amazon EC3 for e.g. There is a cluster manager and a bunch of worker nodes which has many partition of Python executors.
- We learnt to fire off PySpark the interactive Spark environment with Python Command Line Interpreter.
- Lesson 2 was all about Resilient Distributed Datasets (RDD). These are Spark data constructs which store blocks of data. They have resiliency as such that they can be recreated if a node fails. RDDs are immutable. One creates a new RDD by performing transform operations.
- We talked about transform operations or transformations. There are narrow transformations which are not very heavy on network usage such as map, flatmap, coalesce, select etc. There are wide transformations such as ‘groupByKey’, ‘reduceByKey’ and repartition. These cause huge movement of data between nodes and are memory/compute intensive. As part of this lecture’s homework we did a simple ‘Join’ operation of two sets of data in Spark using the PySpark environment. I had no trouble passing this.
- The third lesson in the series was all about jobs scheduling, actions, caching data and use of shared variables. We learned about the ‘Directed Acyclic Graph Scheduler (DAG) and how it allows Spark to keep track of execution pipeline. Interestingly most Spark transformations are negative in the sense they are not triggered until certain actions such as collect().
- The final assignment was a programming assignment where we had to go back to the channel, shows, viewers data from week 4 and redo the assignment using the Spark interactive framework. This assignment was tough and took me a better part of my Sunday. But in the end I was able to upload my answer – total viewership for a particular channel and get 100/100. In getting this one write I had to be shameless and try a lot of things many times ending in ‘epic-fails.’ I also had to repeated go back to past lessons and identify course material relevant to the problem at hand. In the end this assignment was way more completed than should be expected for a business leader to solve – but I am glad I raised my work to meet the expectations of this week.
- I still have a final quiz to pass on the Section 3. After numerous tries (and repeated review of course notes and videos) I am only able to get 4 out of 6 right. I need to get at least 5/6 right. I have (I think) until the end of the week to get this right so that I can pass the course with 16/16 assignments. So far I have 15/16 done.
Overall Observation
Overall I learnt a lot in this course and really
pushed my ability to code and apply the course material to actual programming
problems. Even after such grueling assignments it is humbling to know that I
have far from a working knowledge to use Spark/Hadoop for rigorous data
modeling. I just know what questions to ask!
