Report SQL PDF
Report SQL PDF
Submitted By
G.Datta veerean chowdary-Vtu10096
A.Ram babu-Vtu10116
K.Venkata Ramesh Reddy -Vtu10104
SRC-7
DA38
School of Computing
Department of Computer Science and Engineering
15th MARCH 2019
ABSTRACT
In this Apache Spark SQL tutorial, we will understand various
components and terminologies of Spark SQL like what is
DataSet and DataFrame, what is SqlContext and HiveContext
and What are the features of Spark SQL?
After understanding What is Apache Spark, in this tutorial we
will discuss about Apache Spark SQL. Spark SQL is Spark
module for structured data processing. It runs on top of Spark
Core. It offers much tighter integration between relational and
procedural processing, through declarative DataFrame and
Datasets API. These are the ways which enable users to run SQL
queries over Spark.
INTRODUCTION:
Apache Spark SQL integrates relational processing with Sparks
functional programming. It is Spark module for structured data
processing. Spark SQL blurs the line between RDD and
relational table. It also offers much tighter integration between
relational and procedural processing, through declarative
DataFrame APIs which integrates with Spark code. It thus
provides higher optimization. DataFrame API and Datasets API
are the ways to interact with Spark SQL. As a result, with Spark
SQL, Apache Spark is accessible to more users and improves
optimization for current ones.
Spark SQL runs on top of the Spark Core. It allows developers
to import relational data from Hive tables and parquet files, run
SQL queries over imported data and existing RDDs and easily
write RDDs out to Hive tables or Parquet files. As Spark SQL
provides DataFrame APIs which performs the relational
operation on both external data sources and Sparks built-in
distributed collections. Spark SQL introduces extensible
optimizer calledCatalyst as it helps in supporting a wide range of
data sources and algorithms in Bigdata.
FEATURES:
a. Integrated
Integrate is simply defined as combining or merge. Here, Spark
SQL queries are integrated with Spark programs. Through Spark
SQL we are allowed to query structured data inside Spark
programs. This is possible by using SQL or a DataFrame that
can be used in Java, Scala.
We can run streaming computation through it. Developers write
a batch computation against the DataFrame / Dataset API to run
it. After that to run it in a streaming fashion Spark itself
increments the computation. Developers leverage the advantage
of it that they don’t have to manage state, failures on own. Even
no need keep the application in sync with batch jobs. Despite,
the streaming job always gives the same answer as a batch job
on the same data.
c. High compatibility
We are allowed to run unmodified Hive queries on existing
warehouses in Spark SQL. With existing Hive data, queries and
UDFs, Spark SQL offers full compatibility. Also, rewrites the
MetaStore and Hive frontend.
d. Standard Connectivity
We can easily connect Spark SQL through JDBC or ODBC. For
connectivity for business intelligence tools, Both turned as
industry norms. Also, includes industry standard JDBC and
ODBC connectivity with server mode.
e. Scalability
It takes advantage of RDD model, to support large jobs and
mid-query fault tolerance. For interactive as well as long
queries, it uses the same engine.
f. Performance Optimization
In Spark SQL, query optimization engine converts each SQL
query into a logical plan. Afterwards, it converts to many
physical execution plans. At the time of execution, it selects the
most optimal physical plan, among the entire plan. It ensures
fast execution of HIVE queries.
Conclusion
Hence, we have seen all Spark SQL features in detail. As a
result, we have learned, Spark SQL is a module of Spark that
analyses structured data. Basically, it offers scalability and
ensures high compatibility of the system. Also, allow standard
connectivity through JDBC or ODBC. Therefore, it bestows the
most natural way to express the structured data. Moreover, it
enhances its working efficiency with above-mentioned Spark
SQL features.
Spark SQL Architecture:
The following illustration explains the architecture of Spark
SQL
This architecture contains three layers namely, Language API,
Schema RDD, and Data Sources.
text file, Avro file, etc. However, the Data Sources for
Why DataFrame?
DataFrame is one step ahead of RDD. Since it provides memory
management and optimized execution plan.
a. Custom Memory Management: This is also known as Project
Tungsten. A lot of memory is saved as the data is stored in
off-heap memory in binary format. Apart from this, there is no
Garbage Collection overhead. Expensive Java serialization is
also avoided. Since the data is stored in binary format and the
schema of memory is known.
b. Optimized Execution plan: This is also known as the query
optimizer. Using this, an optimized execution plan is created for
the execution of a query. Once the optimized plan is created
final execution takes place on RDDs of Spark.
You can refer this guide to learn Spark SQL optimization phases
in detail.
iv. The DataFrame API’s are available in various programming
languages. For example Java, Scala, Python, and R.
v. It provides Hive compatibility. We can run unmodified Hive
queries on existing Hive warehouse.
vi. It can scale from kilobytes of data on the single laptop to
petabytes of data on a large cluster.
vii. DataFrame provides easy integration with Big data tools and
framework via Spark core.
Conclusion
Hence, DataFrame API in Spark SQL improves the performance
and scalability of Spark. It avoids the garbage-collection cost of
constructing individual objects for each row in the dataset.
The Spark DataFrame API is different from the RDD API
because it is an API for building a relational query plan that
Spark’s Catalyst optimizer can then execute. This DataFrame
API is good for developers who are familiar with building query
plans. It is not good for the majority of developers.
DATASET:
Today, in this blog on Apache Spark dataset, you can read all
about what is dataset in Spark. Why the Spark DataSet needed,
what is the encoder and what is their significance in the dataset?
You will get the answer to all these questions in this blog.
Moreover, we will also cover the features of the dataset in
Apache Spark and How to create a dataset in this Spark tutorial.
e. Faster Computation
Conclusion
Hence, in conclusion to Dataset, we can say it is a strongly typed
data structure in Apache Spark. Moreover, it represents
structured queries. Also, it fuses together the functionality of
RDD and DataFrame. We can generate the optimized query
using Dataset. So, Dataset lessens the memory consumption and
provides a single API for both Java and Scala.