Big Data Technologies PG-DBDA March 2022
Big Data Technologies PG-DBDA March 2022
Textbook:
1. Hadoop: The Definitive Guide, SPD
Reference:
1. Big Data, Black Book by DreamTech
2. Programming Hive by O’Rellay (Author:- Edward Capriolo, Dean Wampler, and Jason
RutherglenEdward Capriolo, Dean Wampler, and Jason Rutherglen)
1. Hadoop The Definitive Guide 4thEdition by O’Rellay (Author: - Tom White)
2. Hadoop In Practice by Manning (Author: - ALEX HOLMES)
3. Pro Hadoop by Aprss(Author:-Jason Venner)
4. Hadoop with python
5. Hadoop Real-World Solutions Cookbook by Packet publication (Author: Jonathan R.
Owens, Jon Lentz,Brian Femiano)
6. Hadoop In Action by Manning Publications (Author: - CHUCK LAM)
7. Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data
Vault
8. Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset
9. Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large-Scale Data
Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream
Processing
Session: 1, 2 & 3
Introduction to Big Data
o Big Data - Beyond the Hype,
o Big Data Skills and Sources of Big Data,
o Big Data Adoption,
o Research and Changing Nature of Data Repositories,
o Data Sharing and Reuse Practices and Their Implications for Repository Data
Curation,
o Overlooked and Overrated Data Sharing,
o Data Curation Services in Action,
o Open Exit: Reaching the End of The Data Life Cycle,
o The Current State of Meta-Repositories for Data,
Page 1 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o Curation of Scientific Data at Risk of Loss: Data Rescue And Dissemination
Introduction to Hadoop
o A Brief History of Hadoop,
o Evolution of Hadoop,
o Introduction to Hadoop and its components
o Comparison with Other Systems,
o Hadoop Releases
o Hadoop Distributions and Vendors
Lab-Assignment:
o Run the HDFS commands, and add a one liner understanding for each of the
command.
o Execute the provided code using HDFS, step run and understand
Session: 7
Hadoop Architecture
o Hadoop Architecture,
o Core components of Hadoop,
Page 2 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o Common Hadoop Shell commands.
Session: 8
HDFS Data Storage Process
o HDFS Data storage process,
o Anatomy of writing and reading file in HDFS,
o Handling Read/Write failures
o HDFS user and admin commands,
o HDFS Web Interface.
Session: 10
Basics of Map Reduce Programming
o Hadoop Data Types,
o Java and Map Reduce,
o Map Reduce program structure,
o Map-only program, Reduce-only program,
o Use of combiner and partitioner,
o Counters, Schedulers (Job Scheduling),
o Custom Writables, Compression
Lab-Assignment:
o Execute the train data example.
o Execute the train data example using chained methods.
Session: 11
Map Reduce Streaming
o Complex Map Reduce programming,
o Map Reduce streaming,
o Python and Map Reduce,
o Map Reduce on image dataset
Hadoop ETL
Session: 12
Page 3 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o Hadoop ETL Development,
o ETL Process in Hadoop,
o Discussion of ETL functions,
o Data Extractions,
o Need of ETL tools,
o Advantages of ETL tools.
Lab-Assignment:
o Understand the file formats and read the provided links
Lab-Assignment:
o Run the Hbase shell commands
o Run the HBase using Java client
Session: 17
Working with Hive QL
o Datatypes,
o Operators and Functions,
o Hive Tables (Managed Tables and Extended Tables),
o Partitions and Buckets,
o Storage Formats,
o Importing data,
o Altering and Dropping Tables.
Page 4 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
Lab-Assignment:
o Creative a hive DB and table ( internal and external )
o Load the data into hive table (using local inpath and HSFS inpath)
Session:18
Querying with Hive QL
o Querying Data-Sorting,
o Aggregating,
o Map Reduce Scripts,
o Joins and Sub queries,
o Views,
o Map and Reduce side joins to optimize query.
Lab-Assignment:
o Run all the types of joins in Hive
o Execute the data to be partitioned
Session: 19
More on Hive QL
o Data manipulation with Hive,
o UDFs,
o Appending data into existing Hive table,
o custom map/reduce in Hive
o Writing HQL scripts
Lab-Assignment:
o Create a airflow DAG for Extract -> Transform -> Load
Lab-Assignment:
o Run the provided Hadoop Streaming program using python
Session: 26
o Map Reduce with Spark
o Working with Spark with Hadoop
o Working with Spark without Hadoop and their Differences
Lab Assignment
o Execute all the provided code using step-runs for each and every codeline
o Setup the JDBC configuration and run the Spark JDBC Connectivity program
o Run the spark integrations using the provided code
Session: 27
o Data preprocessing
o EDA
Session: 28 and 29
o Introduction to Kafka
o Working with Kafka using Spark
o Spark streaming Architecture
o Spark Streaming APIs
o Building Stream Processing Application with Spark
Lab Assignment
o Execute the spark streaming with Kafka
Session:
30 o Setting up Kafka Producer and Consumer
o Kafka Connect API
Session: 31
o Spark SQL
Lab Assignment
o Run the sparkSQL programs using step-runs for each and every codeline
o Run all the SparkSQL programs
o Analyse the election data using spark and provide analysis
Session: 32 and 33
o Spark MLlib
o Predictive Analysis
Lab Assignment:
o Deep Learning with Spark
o Connecting DB’s with Spark
o Accessing and manipulating the DB’s
Page 6 of 7
o Demo: Capstone Project
Page 7 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o Create a complex workflow using bash operator, a simple workflow using python
o Create Using python airflow operator to read data from your local drive, ingest the
data into your HDFS, and perform a spark WC
Page 8 of 7