0% found this document useful (0 votes)

12 views16 pages

Data Engineering Unit-1

The document provides a comprehensive overview of data engineering, including its definition, lifecycle, skills, and responsibilities. It highlights the importance of data engineering in ensuring data quality and accessibility for analysis, as well as the evolution of the data engineer role over time. Additionally, it discusses the concept of data maturity and the various stages organizations go through to effectively utilize their data.

Uploaded by

msravya896

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views16 pages

Data Engineering Unit-1

Uploaded by

msravya896

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

UNIT-I

Introduction to Data Engineering

1. Definition of Data Engineering
2. Data Engineering Life Cycle
3. Evolution of Data Engineer
4. Data Engineering Versus Data Science
5. Data Engineering Skills and Activities
6. Data Maturity
7. Data Maturity Model
8. Skills of a Data Engineer
9. Business Responsibilities
10. Technical Responsibilities
11. Data Engineers and Other Technical Roles
Basics of Data Engineering
➢ Data engineering is part of the big data ecosystem and is closely linked
to data science.

➢ Data engineers query data from a source (extract), they perform some
modifications to the data (transform), and then they put that data in a
location where users can access it and know that it is production quality
(load).

➢ The terms extract, transform, and load will be used a lot throughout this
subject and will often be abbreviated to ETL.
Data Pipeline
The combination of extracting, loading, and transforming data
is accomplished by the creation of a data pipeline.
The data comes into the pipeline raw, or dirty in the sense that there may be missing data
or typos in the data, which is then cleaned as it flows through the pipe. After that, it comes
out the other side into a data warehouse, where it can be queried.

The following diagram shows the pipeline required to accomplish the task:

The heart of data engineering are data pipelines, which are automated
workflows that move data from various sources to a destination for
analysis.
A data pipeline typically includes several stages:
• Data Ingestion: Collecting data from sources such as databases, APIs, and file
systems.
• Data Processing: Transforming and enriching data to fit analysis needs.
• Data Storage: Storing processed data in a data warehouse or data lake.
• Data Analysis: Providing access to data for analysis and reporting.

1.What is Data Engineering?

• Data engineering is the discipline focused on designing, building, and managing

systems that process and store large volumes of data. It involves creating and
maintaining data pipelines, developing infrastructure for data storage, and ensuring
that data is reliable, accessible, and ready for analysis.

• Data engineers work to transform raw data into formats that are usable for data
scientists, analysts, and other stakeholders, facilitating efficient data processing and
analytics.

• Data engineering supports data scientists and analysts by providing clean, reliable
data.

Why Is Data Engineering Important?

• Data engineering ensures that data is accurate, reliable, and accessible, providing a
solid foundation for data analysis, machine learning, and artificial intelligence
applications.

• Without clean, well-organized data, data analysts and data scientists wouldn’t be
able to find patterns or insights.
• Without effective data engineering, organizations may struggle with data
inconsistencies and inefficiencies, hindering their ability to derive meaningful
insights.

2.Data Engineering Life Cycle

➢ The data engineering lifecycle comprises stages that turn raw data ingredients
into a useful end product, ready for consumption by analysts, data scientists, ML
engineers, and others.

➢ The lifecycle comprises five stages; generation, storage, ingestion, transformation,

and serving data.

➢ However, these stages do not occur on their own, they are supported by principles
such as security, data management, DataOps, data architecture, orchestration, and
software engineering.

3.Evolution of Data Engineer

Data engineering has always been a cornerstone of the technological world, but its
importance has drastically surged over the decades.

With the advent of big data, IoT, machine learning, and other revolutionary
technologies, the role of data engineers has evolved and expanded.
1. Birth of Databases (1960s-1970s)

In the 60s and 70s, data was primarily stored on magnetic tapes. The era saw the
invention of the first Database Management Systems (DBMS), including IBM’s
Information Management System (IMS).

2. Relational Databases (1980s)

Dr. E.F. Codd’s relational model became a game-changer, leading to the

development of Relational Database Management Systems (RDBMS). SQL
(Structured Query Language) was introduced, transforming how we interact with
databases.

3. Data Warehousing (1990s)

Organizations began to see the potential of data for decision-making. This led to
the creation of data warehouses, central repositories where data from various
sources was consolidated for BI (Business Intelligence) tasks.

4. Big Data & Hadoop (2000s)

The 2000s marked the start of the big data revolution. The volume, velocity, and
variety of data skyrocketed. Apache Hadoop, an open-source framework, was
developed to store and process these massive datasets, allowing for distributed
processing.

5. Rise of Data Lakes & Cloud (2010s)

Organizations started moving away from traditional warehouses to more flexible

data lakes. The cloud providers like AWS, Azure, and GCP made data storage and
processing more scalable and cost-effective.
6. Real-time Processing & Streaming (Late 2010s-2020s)

Companies required real-time insights. Technologies like Apache Kafka and Spark
Streaming made it possible to process and analyze data streams in real-time.

7. Data Engineering for Machine Learning (2020s)

Machine learning models need vast amounts of well-curated data. Data engineers
today play an essential role in ensuring data quality, processing, and availability
for training these models.

4.Data engineering versus data science

➢ Data engineering is what makes data science possible.

➢ Data scientists and data engineers use similar tools (Python, for instance), but they
specialize in different areas.

➢ Data engineers need to understand data formats, models, and structures to efficiently
transport data, whereas data scientists utilize them for building statistical models and
mathematical computation.

➢ Data scientists will connect to the data warehouses built by data engineers. From
there, they can extract the data required for machine learning models and analysis.

Data Scientists Data Engineers

Analyzing and interpreting data to Designing, building, and

Primary Focus
extract insights maintaining data infrastructure

• Database management
• Strong analytical skills
• Data modeling
• Machine learning expertise
Skills
• Proficiency in
• Proficiency in
programming
programming languages
languages (e.g., SQL,
(e.g., Python, R)
Java, Python)
Database design and
Statistical analysis, predictive
Tasks and optimization, building data
modeling, and influencing business
Responsibilities pipelines, and ensuring data
strategies based on data
reliability
Collaboration with data
Overlap and Collaboration with data engineers
scientists to ensure data
Collaboration for data access and infrastructure
availability and quality
Continuous adaptation to
Continuous Staying up to date on new
evolving database technologies
Learning algorithms and methodologies
and data management tools
Python, R, TensorFlow, and SQL, Hadoop, Spark, Apache
Common Tools
PyTorch Kafka

5. Required skills and knowledge to be a data engineer

1. At the start of a data pipeline, data engineers need to know how to extract data
from files in different formats or different types of databases. This means data
engineers need to know several languages used to perform many different
tasks, such as SQL and Python.
2. During the transformation phase of the data pipeline, data engineers need to
be familiar with data modeling and structures.
They will also need to understand what insights they are hoping to extract
from the data because this will impact the design of the data models.
3. The loading of data into the data warehouse means there needs to be a data
warehouse with a schema to hold the data. This is also usually the
responsibility of the data engineer. Data engineers will need to know the
basics of data warehouse design, as well as the types of databases used in their
construction.
4. Lastly, the entire infrastructure that the data pipeline runs on could be the
responsibility of the data engineer. They need to know how to manage Linux
servers, as well as how to install and configure software such as Apache
Airflow or NiFi.
5. As organizations move to the cloud, the data engineer now needs to be familiar
with spinning up the infrastructure on the cloud platform used by the
organization – Amazon, Google Cloud Platform, or Azure.
6.Data Maturity
➢ Data maturity refers to an organisation's capability to effectively manage,
utilise, and capitalise on its data.
➢ It involves bringing all of their data sources together, analysing it and
extracting insights to make informed business decisions.

7.Data Maturity Model

➢ A data maturity model is a framework that businesses use to analyze their
current level of data capabilities, identify areas for improvement, and
create goals to transition to higher stages.
➢ Understanding the data maturity stages helps companies effectively plan the
development of data usage.
1. The awareness or explorer stage
At this stage, the company relies on internal data—which it collects from company
servers, surveys and other in-house data-measuring tools—rather than sourcing data from
online or external resources.

This data is primarily for reporting on goals instead of influencing business operations.
In this stage, data is mostly collected to track progress against predefined goals, but it
isn't yet being used to actively improve day-to-day business decisions and operations.

The data team is jumping from one ad-hoc request to the other, always manually
compiling non-standardized reports. They need to create datasets and ETL workflows
from scratch at each request since nothing is modeled.

2. The proficiency or user stage

The second stage involves understanding the importance of data and how it can
improve business operations.

At this level, a company is incorporating data into its decision-making.

Employees analyze data to measure the results of business actions and track the
progress of their goals.

Since more data is being used, the company places strong security measures to protect
sensitive information from unauthorized access.

To keep things running smoothly, the company often uses automated processes to
manage and maintain the continuous flow of data, ensuring it’s always available and up-
to-date.

3. Data Savvy
At the advanced stage of data maturity, companies are using data to become more
competitive.

They have learned from previous stages and now focus on making data easily accessible
and understandable for everyone in the organization, not just data experts.
Easy access to data: All employees, not just data scientists, can easily access and use
data to make decisions.

Data skills for everyone: The company helps all employees learn basic data skills, so
they can understand and work with data in their roles.

Sharing data inside and outside the company: Employees can share data both within
the company and with clients, helping to improve customer satisfaction and make the
company more efficient.

In simple terms, at this stage, data is used more effectively throughout the company to
improve performance and gain an edge in the market.

4. Data driven or innovator stage

In the final stage of data maturity, companies are using data to drive real, meaningful
changes within the organization. At this stage, data is about using insights to shape the
future.

Data-driven goals: Company leaders use data to make informed decisions and set clear,
measurable goals for the company’s growth and success.

Innovation and improvement: Data is used to discover new ways to improve business
practices, streamline processes, and stay ahead of the competition. This leads to better
products, services, and overall efficiency.

Boosting productivity: Data, such as production data, is used to set new productivity
goals and find better ways of working.

In short, at this stage, data becomes a strategic tool that drives decisions, innovation, and
continuous improvement, allowing the company to grow in a competitive market.
8. Skills of a Data Engineer
1. At the start of a data pipeline, data engineers need to know how to extract data
from files in different formats or different types of databases. This means data
engineers need to know several languages used to perform many different
tasks, such as SQL and Python.
2. During the transformation phase of the data pipeline, data engineers need to
be familiar with data modeling and structures.
They will also need to understand what insights they are hoping to extract
from the data because this will impact the design of the data models.
3. The loading of data into the data warehouse means there needs to be a data
warehouse with a schema to hold the data. This is also usually the
responsibility of the data engineer. Data engineers will need to know the
basics of data warehouse design, as well as the types of databases used in their
construction.
4. Lastly, the entire infrastructure that the data pipeline runs on could be the
responsibility of the data engineer. They need to know how to manage Linux
servers, as well as how to install and configure software such as Apache
Airflow or NiFi.
5. As organizations move to the cloud, the data engineer now needs to be familiar
with spinning up the infrastructure on the cloud platform used by the
organization – Amazon, Google Cloud Platform, or Azure.
9. Business Responsibilities
Business Responsibilities of Data Engineers

Data engineers play a crucial role in managing, processing, and making data
useful for decision-making within an organization. Their primary
responsibilities include:

1. Collaborating with Stakeholders

Data engineers work closely with teams such as data scientists,
designers, product managers, and executives. They understand the data
infrastructure needs of these teams and design systems that meet those
requirements effectively.
2. Providing Actionable Insights
By building tools and systems for analyzing data, data engineers turn
raw information into actionable insights. These insights help
organizations evaluate key business metrics, such as sales performance,
customer satisfaction, and operational efficiency, thereby supporting
informed decision-making.
3. Ensuring Data Accessibility
Data engineers ensure data is collected from diverse sources, securely
stored, and made accessible to the right people at the right time. This
accessibility enables teams to leverage data effectively for strategic and
operational decisions.

In summary, data engineers are responsible for creating and maintaining data
systems that empower organizations to access, analyze, and utilize data
effectively, driving business growth and innovation.

10.Technical Responsibilities
Technical responsibilities

• Data storage: Build and maintain data warehouses, and

design and construct data pipelines
• Data quality: Implement measures to ensure the accuracy,
consistency, and reliability of data
• Data security: Implement data security protocols and
ensure compliance with regulations
• Database management: Be familiar with various
database management systems, and understand relational
and non-relational databases
• Programming: Master one or more programming
languages, such as PHP, SQL, Java, or Python
• Data modeling: Understand data modeling, algorithms,
and data transformation techniques
• Continuous learning: Stay updated on new tools,
technologies, and best practices in data engineering
11.Data Engineers and Other Technical Roles
1.Pipeline Developer

Description: Pipeline developers create and Implements data pipelines that allow for
the extraction, transformation, and loading (ETL) of data from various sources into data
warehouses. These pipelines ensure that data flows easily and efficiently from source to
destination, ready for analysis.

Skills Required:
• Programming Languages, proficiency in Python, Java, or Scala for scripting
ETL processes.
• ETL Tools, knowledge of tools such as Apache Nifi, Talend, Informatica, or
custom ETL scripts.
• Database Knowledge, understanding SQL (MySQL) and NoSQL (MongoDB)
databases for data extraction and loading.
• Data modeling, requires knowledge of data warehousing principles and
procedures.
2.Database Architect

Description: Database architects create and manage scalable database solutions for
storing and organizing huge datasets. They make sure that database systems are optimized
for performance, reliability, and scalability.
Skills Required:
• Database Design, proficiency in designing SQL and NOSQL database schemas.
• Database Management Systems, expertise in DBMS such as
MySQL, PostgreSQL, MongoDB, or Cassandra.
3.Data Integration Specialist

Description: Data integration specialists combine data from several sources to generate
a single view. They ensure that data is consistent, accurate, and available for analysis across
the company's operations.
Skills Required:
• Data Mapping, the ability to map data fields from multiple sources into a
single schema.
• Experience with integration platforms such as Apache Camel, MuleSoft, and
Dell Boomi.
• Data Cleaning, data cleansing skills are required to assure data quality and
consistency.
• APIs, proficiency in using APIs to combine data from multiple platforms.
4.Big Data Engineer

Description: Big data engineers use large-scale data processing systems to manage vast
amounts of data. They create and maintain big data infrastructure while also ensuring
that data processing operations are efficient and scalable.
Skills Required:
• Proficiency with Hadoop, Spark, Kafka, and other big data technologies.
• Programming Languages, experience with Java, Scala, or Python for big data
processing.
• Data Storage, knowledge of distributed storage methods such as HDFS,
Amazon S3, and Google Cloud Storage.

5.Data Quality Engineer

Description: Data quality engineers use validation, cleansing, and monitoring methods
to verify that data is accurate and reliable. They use data quality frameworks to discover
and fix data issues before they become problems.
Skills Required:
• Data validation, methods for ensuring data accuracy and completeness.
• Data cleansing, techniques for cleaning and standardising data.
• Quality assurance, understanding of quality assurance methods and tools.

6.DataOps Engineer

Description: DataOps engineers are responsible for the operational aspects of managing
data pipelines and infrastructure. Their aim to increase the efficiency, reliability, and
scalability of data operations, typically integrating DevOps concepts into the data
engineering context.
Skills Required:
• DevOps tools, experience with Docker, Kubernetes, Jenkins, and CI/CD
pipelines.
• Experience with monitoring tools such as Prometheus, Grafana, and the ELK
stack.

Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
100% (2)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
582 pages
2024 07 Eb Big Book of Data Engineering 3rd Edition
100% (2)
2024 07 Eb Big Book of Data Engineering 3rd Edition
125 pages
big-book-of-data-engineering-3rd-edition-1-27-2025
No ratings yet
big-book-of-data-engineering-3rd-edition-1-27-2025
126 pages
Fundamentals of Data Engineering
No ratings yet
Fundamentals of Data Engineering
16 pages
GCP Data Engineer
No ratings yet
GCP Data Engineer
100 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Latihan Azure Microsoft-1
No ratings yet
Latihan Azure Microsoft-1
33 pages
100 Dataengineering Interview Questions TRRaveendra 1694654407
No ratings yet
100 Dataengineering Interview Questions TRRaveendra 1694654407
58 pages
CH1 - Introduction To Data Engineering
No ratings yet
CH1 - Introduction To Data Engineering
36 pages
DSBDA Lab Manual 2022-23
100% (2)
DSBDA Lab Manual 2022-23
148 pages
Become A Data Engineer
100% (2)
Become A Data Engineer
14 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
14 pages
Data Engineering UNIT-1 (2)
No ratings yet
Data Engineering UNIT-1 (2)
5 pages
Introduction to Data Engineering
No ratings yet
Introduction to Data Engineering
13 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
DE Unit I
No ratings yet
DE Unit I
12 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Fundamentals-of-Data-Engineering-Concepts
No ratings yet
Fundamentals-of-Data-Engineering-Concepts
219 pages
The Essence of Data Engineering
No ratings yet
The Essence of Data Engineering
3 pages
2OEeUEnBTY_CompleteGuideToBecomeModernDataEngineer
No ratings yet
2OEeUEnBTY_CompleteGuideToBecomeModernDataEngineer
43 pages
DE UNIT - I
No ratings yet
DE UNIT - I
43 pages
Page 2
No ratings yet
Page 2
3 pages
DE NOTES
No ratings yet
DE NOTES
3 pages
DataEngineering(ut1)
No ratings yet
DataEngineering(ut1)
27 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
100_data_engineering_QUESTIONS_ANSWERS
No ratings yet
100_data_engineering_QUESTIONS_ANSWERS
59 pages
A data engineer is a professional responsible for designing
No ratings yet
A data engineer is a professional responsible for designing
2 pages
Lecture Notes Ch1 (1)
No ratings yet
Lecture Notes Ch1 (1)
24 pages
C1_W1
No ratings yet
C1_W1
91 pages
M3
No ratings yet
M3
11 pages
DE Week-1, Lecture
No ratings yet
DE Week-1, Lecture
3 pages
Inbound 2613578228155417375
No ratings yet
Inbound 2613578228155417375
2 pages
The Evolving Role of The Data Engineer
No ratings yet
The Evolving Role of The Data Engineer
61 pages
12 Must-Have Skills To Become A Data Engineer - by Anuj Syal - DataDrivenInvestor
No ratings yet
12 Must-Have Skills To Become A Data Engineer - by Anuj Syal - DataDrivenInvestor
9 pages
DM Lecture 5
No ratings yet
DM Lecture 5
31 pages
4 Data Engineering
No ratings yet
4 Data Engineering
34 pages
essentials-of-data-engineeringByMukeshSaini
No ratings yet
essentials-of-data-engineeringByMukeshSaini
30 pages
A Internship Report UTTAM
No ratings yet
A Internship Report UTTAM
9 pages
4.data Engineering
No ratings yet
4.data Engineering
9 pages
Lecture 3 Data Engineering Concepts, Processes, and Tools
No ratings yet
Lecture 3 Data Engineering Concepts, Processes, and Tools
2 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
Essentials of Data Engineering -- Saini, Dr_ Mukesh -- 2024 -- Bb50f635b916a3edd2d60d5109fbb873 -- Anna’s Archive (1)
No ratings yet
Essentials of Data Engineering -- Saini, Dr_ Mukesh -- 2024 -- Bb50f635b916a3edd2d60d5109fbb873 -- Anna’s Archive (1)
431 pages
DE UNIT-2
No ratings yet
DE UNIT-2
10 pages
Data Engineer Roadmap 2024 _ Navigating the Landscape of Data Engineering _ by Ansam Yousry _ in Technology Hits - Freedium
No ratings yet
Data Engineer Roadmap 2024 _ Navigating the Landscape of Data Engineering _ by Ansam Yousry _ in Technology Hits - Freedium
12 pages
Role of a Data Engineer. KRA
No ratings yet
Role of a Data Engineer. KRA
2 pages
The roles of Data Engineer and Data Analyst
No ratings yet
The roles of Data Engineer and Data Analyst
4 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
Conceptual Alignment
No ratings yet
Conceptual Alignment
22 pages
M
No ratings yet
M
13 pages
roadmap
No ratings yet
roadmap
3 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Data Engineering - Beginner's Guide
100% (1)
Data Engineering - Beginner's Guide
9 pages
Data Engineering Top 100 Questions
No ratings yet
Data Engineering Top 100 Questions
59 pages
This is What I Will Do to Become a Data Engineer in 2025 _ by Syed Kadar Ansari Syed Ahamed _ Aug, 2024 _ Data Engineer Things
No ratings yet
This is What I Will Do to Become a Data Engineer in 2025 _ by Syed Kadar Ansari Syed Ahamed _ Aug, 2024 _ Data Engineer Things
22 pages
Understanding The Differences Between Data Processing and Data Engineering On The Road Map To Become A Data Scientist
No ratings yet
Understanding The Differences Between Data Processing and Data Engineering On The Road Map To Become A Data Scientist
9 pages
The Big Book of Data Engineering: A Collection of Technical Blogs, Including Code Samples and Notebooks
100% (2)
The Big Book of Data Engineering: A Collection of Technical Blogs, Including Code Samples and Notebooks
57 pages
Daniel Beach - Introduction to Data Engineering-leanpub.com (2022)
No ratings yet
Daniel Beach - Introduction to Data Engineering-leanpub.com (2022)
172 pages
Data Engineeing 1 Pages 2
No ratings yet
Data Engineeing 1 Pages 2
14 pages
60+ Data Engineer Interview Questions and Answers
No ratings yet
60+ Data Engineer Interview Questions and Answers
16 pages
The Background and Skill of Data Engineer
No ratings yet
The Background and Skill of Data Engineer
9 pages
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
From Everand
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit-5
No ratings yet
Unit-5
27 pages
p&s model paper
No ratings yet
p&s model paper
19 pages
UNIT - III
No ratings yet
UNIT - III
4 pages
Receipt Pay_QWQYQnhOcDW7Tx From Main Flow Services and Technologies (Paid).PDF
No ratings yet
Receipt Pay_QWQYQnhOcDW7Tx From Main Flow Services and Technologies (Paid).PDF
1 page
Experiment-1_241216_112158
No ratings yet
Experiment-1_241216_112158
16 pages
02data Stream Processing With Apache Flink
No ratings yet
02data Stream Processing With Apache Flink
61 pages
Bala Bhaskar
No ratings yet
Bala Bhaskar
9 pages
Dinesh Katla AWS Backend Data Engineer Updated
No ratings yet
Dinesh Katla AWS Backend Data Engineer Updated
10 pages
Advanced Certificate Programme DS 1669897036711 PDF
No ratings yet
Advanced Certificate Programme DS 1669897036711 PDF
34 pages
Fabric Onelake
No ratings yet
Fabric Onelake
89 pages
BCA IBM 3 Years
No ratings yet
BCA IBM 3 Years
36 pages
Plaquette en
No ratings yet
Plaquette en
24 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
J OHN
No ratings yet
J OHN
8 pages
Syllabus (AI - ML BlackBelt Plus Program)
No ratings yet
Syllabus (AI - ML BlackBelt Plus Program)
18 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
01-DS320-v67-Course Introduction PDF
No ratings yet
01-DS320-v67-Course Introduction PDF
84 pages
Data Steaming Sylll
No ratings yet
Data Steaming Sylll
12 pages
9 SparkSQL
No ratings yet
9 SparkSQL
47 pages
Data Engineeringg
No ratings yet
Data Engineeringg
4 pages
Big Data For 5G Intelligent Network Slicing
No ratings yet
Big Data For 5G Intelligent Network Slicing
7 pages
Microsoft Testking AI-100 v2020-03-04 by Iwei 82q
No ratings yet
Microsoft Testking AI-100 v2020-03-04 by Iwei 82q
67 pages
GPUMap - A Transparently GPU-Accelerated Python Map Function
No ratings yet
GPUMap - A Transparently GPU-Accelerated Python Map Function
10 pages
Big data analytics notes
No ratings yet
Big data analytics notes
33 pages
DP-600-demo
No ratings yet
DP-600-demo
10 pages
Data Science & ML Syllabus
No ratings yet
Data Science & ML Syllabus
12 pages
Koushik TCS 10yrs CV 1807
No ratings yet
Koushik TCS 10yrs CV 1807
9 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
53 HR Interview Questions and Answers Asked Frequently
No ratings yet
53 HR Interview Questions and Answers Asked Frequently
42 pages
Kibana, Grafana and Zeppelin On Monitoring Data
100% (1)
Kibana, Grafana and Zeppelin On Monitoring Data
21 pages
Aravind_Kumar_Rajendran_Bigdata
No ratings yet
Aravind_Kumar_Rajendran_Bigdata
8 pages
Microsoft Fabric End-to-End Security Whitepaper
No ratings yet
Microsoft Fabric End-to-End Security Whitepaper
75 pages

Uploaded by

Uploaded by

UNIT-I

Introduction to Data Engineering

1.What is Data Engineering?

• Data engineering is the discipline focused on designing, building, and managing

Why Is Data Engineering Important?

2.Data Engineering Life Cycle

➢ The lifecycle comprises five stages; generation, storage, ingestion, transformation,

3.Evolution of Data Engineer

2. Relational Databases (1980s)

Dr. E.F. Codd’s relational model became a game-changer, leading to the

3. Data Warehousing (1990s)

4. Big Data & Hadoop (2000s)

5. Rise of Data Lakes & Cloud (2010s)

Organizations started moving away from traditional warehouses to more flexible

7. Data Engineering for Machine Learning (2020s)

4.Data engineering versus data science

➢ Data engineering is what makes data science possible.

Data Scientists Data Engineers

Analyzing and interpreting data to Designing, building, and

5. Required skills and knowledge to be a data engineer

7.Data Maturity Model

2. The proficiency or user stage

At this level, a company is incorporating data into its decision-making.

4. Data driven or innovator stage

1. Collaborating with Stakeholders

• Data storage: Build and maintain data warehouses, and

5.Data Quality Engineer

You might also like