Big Data Analytics – Week 1.
Key Takeaways
- Data analysis requires you to access, manipulated/transform and query/explore data.
- For data analysis you need something more interactive and versatile than MapReduce.
- There are a number of higher level tools that sit on top of Hadoop HDFS such as HBase, Hiave, Pig, Spark and Splunk.
- In week 1 we went into great detail in HBase, why it is relevant and what is its architecture. We did some data exploring using Hue that uses HBase to explore a large dataset1.
General Introduction
Data Analysis involves analyzing data, gaining insights and
drawing inferences/decisions and then ultimately taking some actions. So data
analysis requires searching for meaningful things. This requires easy access to
data, good data manipulation functions and ability to explore/query the data.
Basic MapReduce framework doesn’t allow us to do that. So we need higher level
tools that sit on toop of Hadoop to allow us interactively work with the
data. Data base management system (DBMS)
is useful for structured data, for transactional uses and for SQL. Hadoop is
used for unstructured data, where the data is growing and allows flexible
mapping. Over time there is an overlap of traditional tools and Hadoop overlay
tools. This course is about tools other than MapReduce that will help us
access, transform and query/explore Big Data.
HBase – stores data in ‘Big Tables’ based on Google. HBase
organizes data into column oriented format with indexing done on a row and
column basis instead of with a key.
Hive - is a metastore where it holds the schedmatics of data
from HDFS. In that way it is like DBMS and allows for an SQL like interface.
Pig – is a scripting language that provides more flexibility
than SQL.
Spark - is a engine for programming with an interface that
works directly with HDFS and provides a richer set of programming functions
other than MapReduce.
Splunk - is useful for working with machine data, such as
web logs etc.
HBase
Hadoop is good for sequential access and batch processing.
HBase allows fast random access of big data and random access and allows
updates. HBase files are stored on HDFS. The data is organized in big rows and
column tables. HBase is not a relational database, sparse and highly
distributed. HBase data can be 100 TB to 1 PB or 1 ZB. HBase is organized along
3 dimensions.
- - Rows
- - Columns
- - Time Stamp (stored changes over time)
HBase Data Model in detail
Columns are grouped in column families. The rows run across
the column families. Several tables – each table has multiple column families.
Designed to run on 10s if not 1000s of servers and is designed to deal with
distributed computing challenges – coordination, management, data distribution
and network latency. There are three components of the architecture
- - Regions – a subset of table rows, horizontal range
- - Region server – slaves
- - Master Server co-ordinates actions of region servers.
We looked at a large dataset which
had web visits to various domains (rows) in different countries (columns). We
were asked to select data, identify the countries or domains that have highest
or lowest number of visitors. This was
an exercise to familiarize ourselves with data selection/exploration in HBase.
No comments:
Post a Comment