Tuesday, June 7, 2016

Introduction to Hive – week 2


Key Takeaways


  • Hive is a very powerful infrastructure tool that allows us to query large datasets in an interactive environment sitting on top of Hadoop.
  • Hive allows us to query using SQL like commands very large structured and unstructured data.
  • The commands are easy to use and do not require programming.
  • We used Hive to analyze data about bike share program in San Francisco. This is the first time we imported public data in form of CSV files and created tables in Hive. We used the Hue environment. 

Hive is an infrastructure tool that allows us to process structured and unstructured data in Hadoop. Hive sits on top of HDFS. It was originally created at Facebook. It provides an interactive environment to Extract Transform and Load (ETL) Big Data.

Why Hive?

If you were running MapReduce on HDFS – you have to run it in batch.  The map and reduce functions are written in Java and the process is slow and not interactive. Hive provides an SQL type of environment with simple commands, that can be run interactively by non Java programmers who are more interested in looking at data than programming mappers. Hive is written for On Line Analytical Processing (OLAP). It is fast, scalable and extensible. It allows custom mappers and reducers if required.

We discussed the sequence of execution steps in Hive from the UI to the Driver to the Compiler to the Metastore and Execution engine. The execution engine interacts with Hadoop and returns the results. Hive can be run in high level languages or local and MapReduce. Hive can be run interactively or in batch mode.

Hive Data structure


Hive tables are HDFS directories. Within HDFS directories are sub-directories which are called partitions in Hive. Buckets are files under each partition. By defining data in buckets you are able to sample large data sets and significantly reduce CPU usage.
It supports primitive types such as integer, floats, strings, doubles, list and structures.

Hive comes with a number of Table commands such as CREATE, SHOW, ALTER, DROP TABLE.  It allows to load data utilizing HDFS or load data into a local file. Hive allows you to do JOIN – one JOIN command can replace 60 or more lines of Java code. Join is easy to use – it combines tables with specific fields based on a common value.

We did a number of Hive Assignment. We used an example data table in the Hue environment. We used SQL like queries to select fields from a Table based on a certain criteria (like Price > 10) and ordered them in descending order. We used these interactive queries to find the highest salary, the lowest salary or the most product you can buy for a $10 etc.

We then went to a more elaborate exercise where we loaded public data of San Francisco bike-share program.  We used command such as count and calculate the number of people borrowing the bike by station. The assignments were relatively easy after going through the Map and Reduce programs in Python in a previous course. See: http://www.bayareabikeshare.com/datachallenge


Saturday, June 4, 2016

Big Data Analytics – Week 1.




Key Takeaways

  1. Data analysis requires you to access, manipulated/transform and query/explore data.
  2. For data analysis you need something more interactive and versatile than MapReduce.
  3. There are a number of higher level tools that sit on top of Hadoop HDFS such as HBase, Hiave, Pig, Spark and Splunk.
  4. In week 1 we went into great detail in HBase, why it is relevant and what is its architecture. We did some data exploring using Hue that uses HBase to explore a large dataset1.    

 General Introduction


Data Analysis involves analyzing data, gaining insights and drawing inferences/decisions and then ultimately taking some actions. So data analysis requires searching for meaningful things. This requires easy access to data, good data manipulation functions and ability to explore/query the data. Basic MapReduce framework doesn’t allow us to do that. So we need higher level tools that sit on toop of Hadoop to allow us interactively work with the data.  Data base management system (DBMS) is useful for structured data, for transactional uses and for SQL. Hadoop is used for unstructured data, where the data is growing and allows flexible mapping. Over time there is an overlap of traditional tools and Hadoop overlay tools. This course is about tools other than MapReduce that will help us access, transform and query/explore Big Data.

HBase – stores data in ‘Big Tables’ based on Google. HBase organizes data into column oriented format with indexing done on a row and column basis instead of with a key.
Hive - is a metastore where it holds the schedmatics of data from HDFS. In that way it is like DBMS and allows for an SQL like interface.
Pig – is a scripting language that provides more flexibility than SQL.
Spark - is a engine for programming with an interface that works directly with HDFS and provides a richer set of programming functions other than MapReduce.
Splunk - is useful for working with machine data, such as web logs etc.

HBase


Hadoop is good for sequential access and batch processing. HBase allows fast random access of big data and random access and allows updates. HBase files are stored on HDFS. The data is organized in big rows and column tables. HBase is not a relational database, sparse and highly distributed. HBase data can be 100 TB to 1 PB or 1 ZB. HBase is organized along 3 dimensions.
  • -       Rows
  • -       Columns
  • -       Time Stamp (stored changes over time)

 HBase Data Model in detail


Columns are grouped in column families. The rows run across the column families. Several tables – each table has multiple column families. Designed to run on 10s if not 1000s of servers and is designed to deal with distributed computing challenges – coordination, management, data distribution and network latency. There are three components of the architecture
  • -       Regions – a subset of table rows, horizontal range
  • -       Region server – slaves
  • -       Master Server co-ordinates actions of region servers.

  HBase depends on zookeeper. Most basic unit is a column. Distinct values for each time stamp. Column. keys are grouped as families. Column family is a stored as a HFile. A cell can hold multiple versions of data. Data stored in such a way such that recent data is fasted to read.

We looked at a large dataset which had web visits to various domains (rows) in different countries (columns). We were asked to select data, identify the countries or domains that have highest or lowest number of  visitors. This was an exercise to familiarize ourselves with data selection/exploration in HBase.