04 Bigdata Hive
04 Bigdata Hive
The table shows the problems related to data inflow & expressiveness, and
the solutions adapted to address the need for an additional data
warehousing system.
It was difficult to develop MapReduce program to express the
data , hence data lacked expressiveness.
Hence, HIVE came into BIG Data . .
What is HIVE ?
HIVE is defined as warehouse system for Hadoop that facilitates ad-hoc Queries
and the analysis of large data sets stored in Hadoop . .
FACTS !!
Hive provides SQL- like flavour to Big data
i.e. HiveQL(HQL). Because of that it’s a
popular choice for Big data Analytics on
Hadoop platform.
HIVE FACTS It provides massive scale-out & fault
tolerance capabilities for data storage
and processing of commodity hardware.
Relying on MapReduce for execution,
hive is batch-oriented and high latency
for query execution.
HIVE | Characteristics
Hive is a system for managing and querying unstructured data into structured data format.
It uses the concepts :
It uses HDFS for storage and retrieval of data.
The scripts of Hive uses MapReduce for execution
JDBC ODBC
Web
Command Line
Interface
Interface Thrift Server
MetaStore
Driver
(Compiler, Optimizer, Executor)
Hadoop
MapReduce + HDFS
Job Name
Tracker Node
Metastore is the component that stores the
system catalog and metadata about tables,
columns, partitions, & so on . .
Metadata is stored in traditional RDMS format,
Apache Hive uses by default DERBY database.
But its not compulosory its complimentary. If
METADATA you wish you can add any JDBC database like
MySQL.
Metadata client : metastore_db
MetaStore
Hive driver is the component that:
Manages the lifecycle of a HIVE Query
Language(HQL) statement as it moves through
Hive;
It maintains a session handle and any session
statistics.
DRIVER It includes three basic components:
Compiler
Optimizer
Executer
Driver
(Compiler, Optimizer, Executor)
Query Compiler is the driver components of Hive
and checks for error, if no error encountered it
converts HiveQL to Directed Acyclic Graph(DAG)
Driver
(Compiler, Optimizer, Executor)
Query optimizer optimizes the HiveQL scripts for
faster execution.
It consists of a chain of transformations, so that
the operator DAG resulting from on
transformations is passed as an input to the
Query OPTIMIZER next transformations.
Driver
(Compiler, Optimizer, Executor)
Hive Execution Engine:
Executes the tasks produced by the compiler in
proper dependency order.
Interacts with the underlying Hadoop Interface to
ensure perfect synchronization with Hadoop
Query EXECUTOR services.
Driver
(Compiler, Optimizer, Executor)
Hive Server
Hive Server is the main component, providing a thrift interface & it provides connectivity
to modules Java DB Connectivity/Open DB Connectivity server namely JDBC/ODBC.
It enables the integration of HIVE with other applications.
JDBC ODBC
Web
Command Line
Interface
Interface Thrift Server
HIVE
DATA MODEL & HIVE QUERY LANGUAGE
HIVE | DATA TYPES
HIVE have three different Data types that are involved in Table Creation.
HIVE have three different Data types that are involved in Table Creation.
HIVE have three different Data types that are involved in Table Creation.
Tables in HIVE are analogous to Tables in Relational Databases. Tables can be filtered, projected,
joined and unioined. Additionally all the data of a table is stored in a directory in HDFS. Hive also
supports the notion of external tables wherein a table can be created on pre-existing files or
directories in HDFS by providing the appropriate location to the table creation.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive
does not use a default location for this table. This comes in handy if you
already have data generated.
Dropping an EXTERNAL table, data in the table is NOT deleted from the
file system.
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration
property.
HQL Command to create External Commands:
Hive stores tables in partitions. Partitions are used to divide the table into related parts.
Partitions make data querying more efficient. For example in the above weather table
the data can be partitioned on the basis of year and month and when query is fired on
weather table this partition can be used as one of the column.
HQL Commands to create the Partitioning the table:
CREATE EXTERNAL TABLE IF NOT EXSISTS weatherext
( wban INT, date STRING)
PARTITIONED BY (year INT, month STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘location of text_file’;
HIVE | FLOW OF CSV FILE INTO HIVE
That’s All for Theory
LETS PRACTICE NOW . . . .