Wednesday, May 4, 2016

Week 3: Introduction to Hadoop Distributed File System (HDFS)


url.png

HDFS Architecture 



Started off with review of the HDFS architecture.  What are the design considerations for HDFS (high scalability, robustness to hardware failure, portability across software/hardware platform) and how they are implemented (using a name node/data node architecture with data replication)?

We looked at the impact of file size on process efficiency. A large number of small files are bad for the performance. It increased network traffic, requires more name node memory and dramatically increases the number of maps.

We studied how a write operations to the HDFS takes place using the name node/data node. It stressed the importance of replication using a pipeline process to speed up write tasks. Similarly talked about how a read task is accomplished.

HDFS Performance, Tuning Parameters and Robustness


We then looked into HDFS performance and robustness.  HDFS allows one to set parameters such as replication factor (how many copies of the block data are made for redundancy) and block size (typically 64 MB but could be 128 MB).

The default replication factor is 3. If the replication is reduced, you gain in performance and use lower data but you give up robustness. If replication factor is one then you don’t have any backup in case a file is corrupted or the data node fails for some reason. They showed an example where the replication factor was reduced from 3 to 1 and the data rate increased from 779 MB/sec to 2995 MB/sec.

The lecture showed how the system recovers when a data node fails – the name node does not detect the heart beat – it stops all i/O and switches to a replicated block and adds another replicated block to keep the replicator factor value. If there is a name node failure then the operator has to manually start a duplicate name node using backup meta data.


This was a very interesting lesson as for the first time I learned how HDFS deals with data node and name node failure. It also highlighted the tradeoffs between performance and robustness and some of the parameters that can be changed in HDFS.

HDFS Access Command Line and APIs

I falsely assumed that there were only 2 lessons in Week 3. In fact there were 3! The last lesson talked about how one can access HDFS through Command Line Interface & APIs so that it can be accessed programmatically through Java and Web REST. We can also access HDFS through Apache Flume – collecting streaming data and Apache Sqoop – bulk transfer of data between Hadoop and a data set.

We practices some command line HDFS comments such as ls (directory listing), mkdir (create a subdirectory), put (store) and get ( retrieve) data from HDFS to local directory. We also looked at some HDFS administration report generating commands.
We saw examples of HDFS classes needed and additional classes and methods available. We also looked at implementation of certain commands using Web REST APIs using the ‘curl’ method and showed that they are equivalent to the ‘get’, ‘put’ & ‘ls’ commands in the command line interface.
Learned some interesting lingo such as lets ‘spin the VM.’  I had a hard time working on the Java section as I have not done much Java programming and REST API is also alien to me.


In the end I took the quiz and I got 5 out of  6 write – not bad! Onward and upward to MapReduce.

No comments:

Post a Comment