0% found this document useful (0 votes)
99 views

04 Bigdata Hive

Hive is a data warehousing system for Hadoop that allows users to query large datasets stored in Hadoop files using a SQL-like language called HiveQL. It addresses the issues of lack of structure and expressiveness in MapReduce programs for analyzing large datasets. Hive provides structure to data stored in files on HDFS and the ability to express queries using a simple SQL-like language. It uses a metastore to store metadata about tables, columns and partitions. The Hive driver compiles HiveQL queries into a directed acyclic graph of MapReduce jobs that are executed by the execution engine on Hadoop clusters.

Uploaded by

Rohit Uppal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

04 Bigdata Hive

Hive is a data warehousing system for Hadoop that allows users to query large datasets stored in Hadoop files using a SQL-like language called HiveQL. It addresses the issues of lack of structure and expressiveness in MapReduce programs for analyzing large datasets. Hive provides structure to data stored in files on HDFS and the ability to express queries using a simple SQL-like language. It uses a metastore to store metadata about tables, columns and partitions. The Hive driver compiles HiveQL queries into a directed acyclic graph of MapReduce jobs that are executed by the execution engine on Hadoop clusters.

Uploaded by

Rohit Uppal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

HIVE

DATA WAREHOUSING USING HIVE QUERY


Additional Data Warehousing System

 The table shows the problems related to data inflow & expressiveness, and
the solutions adapted to address the need for an additional data
warehousing system.
 It was difficult to develop MapReduce program to express the
data , hence data lacked expressiveness.
 Hence, HIVE came into BIG Data . .
What is HIVE ?

HIVE is defined as warehouse system for Hadoop that facilitates ad-hoc Queries
and the analysis of large data sets stored in Hadoop . .
FACTS !!
 Hive provides SQL- like flavour to Big data
i.e. HiveQL(HQL). Because of that it’s a
popular choice for Big data Analytics on
Hadoop platform.
HIVE FACTS  It provides massive scale-out & fault
tolerance capabilities for data storage
and processing of commodity hardware.
 Relying on MapReduce for execution,
hive is batch-oriented and high latency
for query execution.
HIVE | Characteristics

Hive is a system for managing and querying unstructured data into structured data format.
It uses the concepts :
 It uses HDFS for storage and retrieval of data.
 The scripts of Hive uses MapReduce for execution

Hive commands are similar to SQL which Interoperability (extensible framework to


is data warehousing tool. support different files and data formats).
Principles
of
Hive Performance is better in Hive since Hive
Extensibility (pluggable MapReduce scripts
engine uses the best in-built script to
in the language of your choice – rich, user
reduce the execution time while enabling
def data types & user def functions.
high output.
System Architecture and Components of Hive

JDBC ODBC
Web
Command Line
Interface
Interface Thrift Server
MetaStore
Driver
(Compiler, Optimizer, Executor)

Hadoop
MapReduce + HDFS
Job Name
Tracker Node
 Metastore is the component that stores the
system catalog and metadata about tables,
columns, partitions, & so on . .
 Metadata is stored in traditional RDMS format,
Apache Hive uses by default DERBY database.
But its not compulosory its complimentary. If
METADATA you wish you can add any JDBC database like
MySQL.
 Metadata client : metastore_db

MetaStore
Hive driver is the component that:
 Manages the lifecycle of a HIVE Query
Language(HQL) statement as it moves through
Hive;
 It maintains a session handle and any session
statistics.
DRIVER  It includes three basic components:
 Compiler
 Optimizer
 Executer

Driver
(Compiler, Optimizer, Executor)
Query Compiler is the driver components of Hive
and checks for error, if no error encountered it
converts HiveQL to Directed Acyclic Graph(DAG)

Query COMPILER of MapReduce tasks.

Driver
(Compiler, Optimizer, Executor)
 Query optimizer optimizes the HiveQL scripts for
faster execution.
 It consists of a chain of transformations, so that
the operator DAG resulting from on
transformations is passed as an input to the
Query OPTIMIZER next transformations.

Driver
(Compiler, Optimizer, Executor)
 Hive Execution Engine:
 Executes the tasks produced by the compiler in
proper dependency order.
 Interacts with the underlying Hadoop Interface to
ensure perfect synchronization with Hadoop
Query EXECUTOR services.

Driver
(Compiler, Optimizer, Executor)
Hive Server

 Hive Server is the main component, providing a thrift interface & it provides connectivity
to modules Java DB Connectivity/Open DB Connectivity server namely JDBC/ODBC.
 It enables the integration of HIVE with other applications.

JDBC ODBC
Web
Command Line
Interface
Interface Thrift Server
HIVE
DATA MODEL & HIVE QUERY LANGUAGE
HIVE | DATA TYPES

HIVE have three different Data types that are involved in Table Creation.

PREMITIVE TYPES COMPLEX TYPES USER-DEFINED TYPES

Integers: TINYINT, SMALLINT, INT and BIGINT


Boolean: BOOLEAN
Floating Types: FLOAT , DOUBLE
String: STRING (VARCHAR, CHAR)
HIVE | DATA TYPES

HIVE have three different Data types that are involved in Table Creation.

PREMITIVE TYPES COMPLEX TYPES USER-DEFINED TYPES

Structs: {a INT; b:INT}


Maps: M[‘group’]
Arrays: [‘a’, ‘b’, ‘c’], A[1] returns ‘b’
HIVE | DATA TYPES

HIVE have three different Data types that are involved in Table Creation.

PREMITIVE TYPES COMPLEX TYPES USER-DEFINED TYPES

Structures with Attributes


Attributes can be of Any Type
HIVE | DATA MODELS

Tables in HIVE are analogous to Tables in Relational Databases. Tables can be filtered, projected,
joined and unioined. Additionally all the data of a table is stored in a directory in HDFS. Hive also
supports the notion of external tables wherein a table can be created on pre-existing files or
directories in HDFS by providing the appropriate location to the table creation.

Two Types of tables in HIVE

Managed Tables External Tables


HIVE | DATA MODELS TABLE

 HQL Command used to create Tables:


CREATE [TEMPORARY] [EXTERNAL] TABLE IF NOT EXISTS [db_name.]
tab_name[(col_name data_type [COMMENT col_comment], . . . )]
[COMMENT table_comment]
[ROW FORMAT row_format_type]
[STORED AS file_format_type]

COMMENT ‘db_details’ ROW FORMAT DELIMITED


FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’ STORED AS file_format_type

IT SORES TABLES IN HIVE HDFS WAREHOUSE . . . .


HIVE | DATA MODELS EXTERNAL TABLE

 The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive
does not use a default location for this table. This comes in handy if you
already have data generated.
 Dropping an EXTERNAL table, data in the table is NOT deleted from the
file system.
 An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration
property.
 HQL Command to create External Commands:

CREATE EXTERNAL TABLE weatherext ( wban INT, date STRING)


COMMENT ‘this is external table view’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE LOCATION ‘ /hive/data/weatherext’;
HIVE | PARTITIONING TABLES

 Hive stores tables in partitions. Partitions are used to divide the table into related parts.
Partitions make data querying more efficient. For example in the above weather table
the data can be partitioned on the basis of year and month and when query is fired on
weather table this partition can be used as one of the column.
 HQL Commands to create the Partitioning the table:
CREATE EXTERNAL TABLE IF NOT EXSISTS weatherext
( wban INT, date STRING)
PARTITIONED BY (year INT, month STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘location of text_file’;
HIVE | FLOW OF CSV FILE INTO HIVE
That’s All for Theory
LETS PRACTICE NOW . . . .

You might also like