0% found this document useful (0 votes)
87 views

Big Data Technologies PG-DBDA March 2022

The document outlines teaching guidelines for a course on Big Data technologies that includes 66 classroom hours and 84 lab hours. The objective is to teach skills in Hadoop, MapReduce, HBase, Pig, and Spark. The course covers topics like HDFS, MapReduce, HBase, Hive, Spark, and Apache Airflow. Students will learn through lectures, labs exercises involving installing/configuring technologies and writing queries/programs. Evaluation includes exams and assignments focused on both theoretical and practical concepts. Reference materials include books on Hadoop, Big Data, Hive, and related topics.

Uploaded by

srinivasa helwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Big Data Technologies PG-DBDA March 2022

The document outlines teaching guidelines for a course on Big Data technologies that includes 66 classroom hours and 84 lab hours. The objective is to teach skills in Hadoop, MapReduce, HBase, Pig, and Spark. The course covers topics like HDFS, MapReduce, HBase, Hive, Spark, and Apache Airflow. Students will learn through lectures, labs exercises involving installing/configuring technologies and writing queries/programs. Evaluation includes exams and assignments focused on both theoretical and practical concepts. Reference materials include books on Hadoop, Big Data, Hive, and related topics.

Uploaded by

srinivasa helwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Suggested Teaching Guidelines for

Big Data Technologies PG-DBDA March 2022


Duration: 66 Classroom hours + 84 Lab hours

Objective: To reinforce knowledge of BigData Technologies such as Hadoop, Map reduce,HBase,


PIG, Spark (PySpark)

Prerequisites:Knowledge of Linux command, SQL and Core Java

Evaluation method: Theory exam – 40% weightage


Lab exam – 40% weightage
Internal exam – 20% weightage

List of Books / Other training material

Textbook:
1. Hadoop: The Definitive Guide, SPD

Reference:
1. Big Data, Black Book by DreamTech
2. Programming Hive by O’Rellay (Author:- Edward Capriolo, Dean Wampler, and Jason
RutherglenEdward Capriolo, Dean Wampler, and Jason Rutherglen)
1. Hadoop The Definitive Guide 4thEdition by O’Rellay (Author: - Tom White)
2. Hadoop In Practice by Manning (Author: - ALEX HOLMES)
3. Pro Hadoop by Aprss(Author:-Jason Venner)
4. Hadoop with python
5. Hadoop Real-World Solutions Cookbook by Packet publication (Author: Jonathan R.
Owens, Jon Lentz,Brian Femiano)
6. Hadoop In Action by Manning Publications (Author: - CHUCK LAM)
7. Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data
Vault
8. Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset
9. Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large-Scale Data
Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream
Processing

Note: Each session having 2 Hours

Introduction to Bigdata and Hadoop (Theory- 16 Hrs and Lab- 06 Hrs)

Session: 1, 2 & 3
Introduction to Big Data
o Big Data - Beyond the Hype,
o Big Data Skills and Sources of Big Data,
o Big Data Adoption,
o Research and Changing Nature of Data Repositories,
o Data Sharing and Reuse Practices and Their Implications for Repository Data
Curation,
o Overlooked and Overrated Data Sharing,
o Data Curation Services in Action,
o Open Exit: Reaching the End of The Data Life Cycle,
o The Current State of Meta-Repositories for Data,
Page 1 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o Curation of Scientific Data at Risk of Loss: Data Rescue And Dissemination
Introduction to Hadoop
o A Brief History of Hadoop,
o Evolution of Hadoop,
o Introduction to Hadoop and its components
o Comparison with Other Systems,
o Hadoop Releases
o Hadoop Distributions and Vendors

Hadoop Distributed File System (HDFS)


Session: 4 & 5
Hadoop Distributed File System (HDFS)
o Distributed File System,
o What is HDFS,
o Where does HDFS fit in,
o Core components of HDFS,
o HDFS Daemons,
o Hadoop Server Roles: Name Node, Secondary Name Node, and Data Node
HDFS Architecture
o HDFS Architecture,
o Scaling and Rebalancing,
o Replication,
o Rack Awareness,
o Data Pipelining,
o Node Failure Management.
o HDFS High Availability NameNode

Lab-Assignment:
o Run the HDFS commands, and add a one liner understanding for each of the
command.
o Execute the provided code using HDFS, step run and understand

Hadoop Installation and Cluster Configuration (Lab – 02 Hrs)


Getting Started: Hadoop Installation
o Hadoop Operation modes
o Setting up a Hadoop Cluster,
o Cluster specification,
o Single and Multi-Node Cluster Setup on Virtual & Physical Machines,
o Remote Login using Putty/Mac Terminal/Ubuntu Terminal.
o Hadoop Configuration, Security in Hadoop, Administering Hadoop,
o HDFS – Monitoring & Maintenance, Hadoop benchmarks,
o Hadoop in the cloud.

Session: 7
Hadoop Architecture
o Hadoop Architecture,
o Core components of Hadoop,

Page 2 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o Common Hadoop Shell commands.

Session: 8
HDFS Data Storage Process
o HDFS Data storage process,
o Anatomy of writing and reading file in HDFS,
o Handling Read/Write failures
o HDFS user and admin commands,
o HDFS Web Interface.

Map Reduce (Theory – 06 Hrs & Lab – 12 Hrs)


Session: 9
Getting in touch with Map Reduce Framework
o Hadoop Map Reduce paradigm,
o Map and Reduce tasks,
o Map Reduce Execution Framework,
o Map Reduce Daemons
o Anatomy of a Map Reduce Job run
More Map Reduce Concepts
o Partitioners and Combiners,
o Input Formats (Input Splits and Records, Text Input, Binary Input, Multiple
Inputs),
o Output Formats (Text Output, Binary Output, Multiple Output).
o Distributed Cache

Session: 10
Basics of Map Reduce Programming
o Hadoop Data Types,
o Java and Map Reduce,
o Map Reduce program structure,
o Map-only program, Reduce-only program,
o Use of combiner and partitioner,
o Counters, Schedulers (Job Scheduling),
o Custom Writables, Compression

Lab-Assignment:
o Execute the train data example.
o Execute the train data example using chained methods.

Session: 11
Map Reduce Streaming
o Complex Map Reduce programming,
o Map Reduce streaming,
o Python and Map Reduce,
o Map Reduce on image dataset

Hadoop ETL
Session: 12
Page 3 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o Hadoop ETL Development,
o ETL Process in Hadoop,
o Discussion of ETL functions,
o Data Extractions,
o Need of ETL tools,
o Advantages of ETL tools.

Lab-Assignment:
o Understand the file formats and read the provided links

HBase (Theory – 06 Hrs & Lab – 06 Hrs)


Session: 13
Introduction to HBase
o Overview of HBase
o HBase architecture
o Installation
Session: 14 and 15
The HBaseAdmin and HBase Security

o Various Operations on Tables


o HBase general command and shell,
o java client API for HBase
o Admin API
o CRUD operations
o Client API
o HBase – Scan, Count and Truncate
o HBase Security

Lab-Assignment:
o Run the Hbase shell commands
o Run the HBase using Java client

Hive (Theory – 08 Hrs & Lab – 18 Hrs)


Session: 16
The Hive Data-ware House
o Introduction to Hive,
o Hive architecture and Installation,
o Comparison with Traditional Database,
o Basics of Hive Query Language.

Session: 17
Working with Hive QL
o Datatypes,
o Operators and Functions,
o Hive Tables (Managed Tables and Extended Tables),
o Partitions and Buckets,
o Storage Formats,
o Importing data,
o Altering and Dropping Tables.
Page 4 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022

Lab-Assignment:
o Creative a hive DB and table ( internal and external )
o Load the data into hive table (using local inpath and HSFS inpath)

Session:18
Querying with Hive QL
o Querying Data-Sorting,
o Aggregating,
o Map Reduce Scripts,
o Joins and Sub queries,
o Views,
o Map and Reduce side joins to optimize query.

Lab-Assignment:
o Run all the types of joins in Hive
o Execute the data to be partitioned

Session: 19
More on Hive QL
o Data manipulation with Hive,
o UDFs,
o Appending data into existing Hive table,
o custom map/reduce in Hive
o Writing HQL scripts

Apache Airflow (Theory – 06 Hrs & Lab – 06 Hrs)


Session: 20, 21and 22
o Introduction to Data Warehousing and Data Lakes
o Designing Data warehousing for an ETL Data Pipeline
o Designing Data Lakes for an ETL Data Pipeline
o ETL vs ELT
o Fundamentals of Airflow
o Work management with Airflow
o Automating an entire Data Pipeline with Airflow

Lab-Assignment:
o Create a airflow DAG for Extract -> Transform -> Load

Introduction to Apache Spark& Kafka (Theory – 24 Hrs & Lab – 36 Hrs)

Session: 23, 24 and 25


Apache Spark APIs for large-scale data processing
o Overview, Linking with Spark, Initializing Spark,
o Resilient Distributed Datasets (RDDs), External Datasets
o RDD v/s Data frames v/s Datasets
o Data frame operations
o Structured Spark Streaming
o Passing Functions to Spark, Working with Key-Value Pairs, Shuffle operations,
Page 5 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o RDD Persistence, Removing Data, Shared Variables, Deploying to a Cluster

Lab-Assignment:
o Run the provided Hadoop Streaming program using python

Session: 26
o Map Reduce with Spark
o Working with Spark with Hadoop
o Working with Spark without Hadoop and their Differences
Lab Assignment
o Execute all the provided code using step-runs for each and every codeline
o Setup the JDBC configuration and run the Spark JDBC Connectivity program
o Run the spark integrations using the provided code

Session: 27
o Data preprocessing
o EDA

Session: 28 and 29
o Introduction to Kafka
o Working with Kafka using Spark
o Spark streaming Architecture
o Spark Streaming APIs
o Building Stream Processing Application with Spark

Lab Assignment
o Execute the spark streaming with Kafka

Session:
30 o Setting up Kafka Producer and Consumer
o Kafka Connect API

Session: 31
o Spark SQL

Lab Assignment
o Run the sparkSQL programs using step-runs for each and every codeline
o Run all the SparkSQL programs
o Analyse the election data using spark and provide analysis

Session: 32 and 33
o Spark MLlib
o Predictive Analysis

Lab Assignment:
o Deep Learning with Spark
o Connecting DB’s with Spark
o Accessing and manipulating the DB’s
Page 6 of 7
o Demo: Capstone Project

Page 7 of 7
Suggested Teaching Guidelines for
Big Data Technologies PG-DBDA March 2022
o Create a complex workflow using bash operator, a simple workflow using python
o Create Using python airflow operator to read data from your local drive, ingest the
data into your HDFS, and perform a spark WC

Page 8 of 7

You might also like