Introduction to Hive – week 2
Key Takeaways
- Hive is a very powerful infrastructure tool that allows us to query large datasets in an interactive environment sitting on top of Hadoop.
- Hive allows us to query using SQL like commands very large structured and unstructured data.
- The commands are easy to use and do not require programming.
- We used Hive to analyze data about bike share program in San Francisco. This is the first time we imported public data in form of CSV files and created tables in Hive. We used the Hue environment.
Hive is an infrastructure tool that allows us to process
structured and unstructured data in Hadoop. Hive sits on top of HDFS. It was
originally created at Facebook. It provides an interactive environment to
Extract Transform and Load (ETL) Big Data.
Why Hive?
If you were running MapReduce on HDFS – you have to run it
in batch. The map and reduce functions
are written in Java and the process is slow and not interactive. Hive provides
an SQL type of environment with simple commands, that can be run interactively
by non Java programmers who are more interested in looking at data than
programming mappers. Hive is written for On Line Analytical Processing (OLAP).
It is fast, scalable and extensible. It allows custom mappers and reducers if
required.
We discussed the sequence of execution steps in Hive from
the UI to the Driver to the Compiler to the Metastore and Execution engine. The
execution engine interacts with Hadoop and returns the results. Hive can be run
in high level languages or local and MapReduce. Hive can be run interactively
or in batch mode.
Hive Data structure
Hive tables are HDFS directories. Within HDFS directories
are sub-directories which are called partitions in Hive. Buckets are files
under each partition. By defining data in buckets you are able to sample large
data sets and significantly reduce CPU usage.
It supports primitive types such as integer, floats,
strings, doubles, list and structures.
Hive comes with a number of Table commands such as CREATE,
SHOW, ALTER, DROP TABLE. It allows to
load data utilizing HDFS or load data into a local file. Hive allows you to do
JOIN – one JOIN command can replace 60 or more lines of Java code. Join is easy
to use – it combines tables with specific fields based on a common value.
We did a number of Hive Assignment. We used an example data
table in the Hue environment. We used SQL like queries to select fields from a
Table based on a certain criteria (like Price > 10) and ordered them in
descending order. We used these interactive queries to find the highest salary,
the lowest salary or the most product you can buy for a $10 etc.
We then went to a more elaborate exercise where we loaded
public data of San Francisco bike-share program. We used command such as count and calculate
the number of people borrowing the bike by station. The assignments were
relatively easy after going through the Map and Reduce programs in Python in a
previous course. See: http://www.bayareabikeshare.com/datachallenge