
HDFS Architecture
Started off with review of the HDFS architecture. What are the design considerations for HDFS
(high scalability, robustness to hardware failure, portability across
software/hardware platform) and how they are implemented (using a name
node/data node architecture with data replication)?
We looked at the impact of file size on process efficiency. A
large number of small files are bad for the performance. It increased network
traffic, requires more name node memory and dramatically increases the number
of maps.
We studied how a write operations to the HDFS takes place
using the name node/data node. It stressed the importance of replication using
a pipeline process to speed up write tasks. Similarly talked about how a read
task is accomplished.
HDFS Performance, Tuning Parameters and Robustness
We then looked into HDFS performance and robustness. HDFS allows one to set parameters such as
replication factor (how many copies of the block data are made for redundancy)
and block size (typically 64 MB but could be 128 MB).
The default replication factor is 3. If the replication is
reduced, you gain in performance and use lower data but you give up robustness.
If replication factor is one then you don’t have any backup in case a file is
corrupted or the data node fails for some reason. They showed an example where
the replication factor was reduced from 3 to 1 and the data rate increased from
779 MB/sec to 2995 MB/sec.
The lecture showed how the system recovers when a data node
fails – the name node does not detect the heart beat – it stops all i/O and
switches to a replicated block and adds another replicated block to keep the
replicator factor value. If there is a name node failure then the operator has
to manually start a duplicate name node using backup meta data.
This was a very interesting lesson as for the first time I
learned how HDFS deals with data node and name node failure. It also
highlighted the tradeoffs between performance and robustness and some of the
parameters that can be changed in HDFS.
HDFS Access Command Line and APIs
I falsely assumed that there were only 2 lessons in Week 3.
In fact there were 3! The last lesson talked about how one can access HDFS
through Command Line Interface & APIs so that it can be accessed programmatically
through Java and Web REST. We can also access HDFS through Apache Flume –
collecting streaming data and Apache Sqoop – bulk transfer of data between
Hadoop and a data set.
We practices some command line HDFS comments such as ls
(directory listing), mkdir (create a subdirectory), put (store) and get (
retrieve) data from HDFS to local directory. We also looked at some HDFS
administration report generating commands.
We saw examples of HDFS classes needed and additional
classes and methods available. We also looked at implementation of certain
commands using Web REST APIs using the ‘curl’ method and showed that they are
equivalent to the ‘get’, ‘put’ & ‘ls’ commands in the command line
interface.
Learned some interesting lingo such as lets ‘spin the VM.’ I had a hard time working on the Java section
as I have not done much Java programming and REST API is also alien to me.
In the end I took the quiz and I got 5 out of 6 write – not bad! Onward and upward to MapReduce.
No comments:
Post a Comment