GCP Data Engineer Course Content
GCP Data Engineer Course Content
Topics:
• Introduction to data lakes
• Data storage and ETL options on Google Cloud
• Building a data lake using Cloud Storage
• Securing Cloud Storage
• Storing all sorts of data types
• Cloud SQL as a relational data lake
Objectives:
• Understand why Cloud Storage is a great option for building a data lake on Google Cloud
• Learn how to use Cloud SQL for a relational data lake
Activities:
• Lab: Loading Taxi Data into Cloud SQL
Module 03 - Building a Data Warehouse
Topics:
• The modern data warehouse
• Introduction to BigQuery
• Getting started with BigQuery
• Loading data
• Exploring schemas
• Schema design
• Nested and repeated fields
• Optimizing with partitioning and clustering
Objectives:
• Discuss requirements of a modern warehouse
• Understand why BigQuery is the scalable data warehousing solution on Google Cloud
• Understand core concepts of BigQuery and review options of loading data into BigQuery
Activities:
• Lab: Loading Data into BigQuery
• Lab: Working with JSON and Array Data in BigQuery
Topics:
• EL, ELT, ETL
• Quality considerations
• How to carry out operations in BigQuery
• Shortcomings
• ETL to solve data quality issues
Objectives:
• Review different methods of loading data into your data lakes and warehouses: EL, ELT,
and ETL
• Discuss data quality considerations and when to use ETL instead of EL and ELT
Topics:
• The Hadoop ecosystem
• Run Hadoop on Dataproc
• Cloud Storage instead of HDFS
• Optimize Dataproc
Objectives:
• Review the parts of the Hadoop ecosystem
• Learn how to lift and shift your existing Hadoop workloads to the cloud using Dataproc
• Understand considerations around using Cloud Storage instead of HDFS for storage
• Learn how to optimize Dataproc jobs
Activities:
• Lab: Running Apache Spark jobs on Dataproc
Topics:
• Introduction to Dataflow
• Why customers value Dataflow
• Dataflow pipelines
• Aggregating with GroupByKey and Combine
• Side inputs and windows
• Dataflow templates
• Dataflow SQL
Objectives:
• Understand how to decide between Dataflow and Dataproc for processing data pipelines
• Understand the features that customers value in Dataflow
• Discuss core concepts in Dataflow
• Review the use of Dataflow templates and SQL
Activities:
• Lab: A Simple Dataflow Pipeline (Python/Java)
• Lab: MapReduce in Dataflow (Python/Java)
• Lab: Side inputs (Python/Java)
Module 07 - Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
Topics:
• Building batch data pipelines visually with Cloud Data Fusion
• Components
• UI overview
• Building a pipeline
• Exploring data using Wrangler
• Orchestrating work between Google Cloud services with Cloud Composer
• Apache Airflow environment
• DAGs and operators
• Workflow scheduling
• Monitoring and logging
Objectives:
• Discuss how to manage your data pipelines with Data Fusion and Cloud Composer
• Understand Data Fusion’s visual design capabilities
• Learn how Cloud Composer can help to orchestrate the work across multiple Google
Cloud services
Activities:
• Lab: Building and Executing a Pipeline Graph in Data Fusion
• Optional Lab: An introduction to Cloud Composer
Topics:
• Introduction to Pub/Sub
• Pub/Sub push versus pull
• Publishing with Pub/Sub code
Objectives:
• Describe the Pub/Sub service
• Understand how Pub/Sub works
• Gain hands-on Pub/Sub experience with a lab that simulates real-time streaming sen
• Gain hands-on Pub/Sub experience with a lab that simulates real-time streaming sensor
data
Activities:
• Lab: Publish Streaming Data into Pub/Sub
Topics:
• Steaming data challenges
• Dataflow windowing
Objectives:
• Understand the Dataflow service
• Build a stream processing pipeline for live traffic data
• Demonstrate how to handle late data using watermarks, triggers, and accumulation
Activities:
• Lab: Streaming Data Pipelines
Topics:
• Streaming into BigQuery and visualizing results
• High-throughput streaming with Cloud Bigtable
• Optimizing Cloud Bigtable performance
Objectives:
• Learn how to perform ad hoc analysis on streaming data using BigQuery and dashboards
• Understand how Cloud Bigtable is a low-latency solution
• Describe how to architect for Bigtable and how to ingest data into Bigtable
• Highlight performance considerations for the relevant services
Activities:
• Lab: Streaming Analytics and Dashboards
• Lab: Streaming Data Pipelines into Bigtable
Topics:
• Analytic window functions
• Use With clauses
• GIS functions
• Performance considerations
Objectives:
• Review some of BigQuery’s advanced analysis capabilities
• Discuss ways to improve query performance
Activities:
• Lab: Optimizing your BigQuery Queries for Performance
• Optional Lab: Partitioned Tables in BigQuery
Topics:
• What is AI?
• From ad-hoc data analysis to data-driven decisions
• Options for ML models on Google Cloud
Objectives:
• Understand the proposition that ML adds value to your data
• Understand the relationship between ML, AI, and Deep Learning
• Identify ML options on Google Cloud
Topics:
• Unstructured data is hard
• ML APIs for enriching data
Objectives:
• Discuss challenges when working with unstructured data
• Learn the applications of ready-to-use ML APIs on unstructured data
Activities:
• Lab: Using the Natural Language API to Classify Unstructured Text
Topics:
• What’s a notebook?
• BigQuery magic and ties to Pandas
Objectives:
• Introduce Notebooks as a tool for prototyping ML solutions
• Learn to execute BigQuery commands from Notebooks
Activities:
• Lab: BigQuery in Jupyter Labs on AI Platform
Topics:
• Ways to do ML on Google Cloud
• Vertex AI Pipelines
• AI Hub
Objectives:
• Describe options available for building custom ML models
• Understand the use of tools like Vertex AI Pipelines
Activities:
• Lab: Running Pipelines on Vertex AI
Topics:
• BigQuery ML for quick model building
• Supported models
Objectives:
• Learn how to create ML models by using SQL syntax in BigQuery
• Demonstrate building different kinds of ML models using BigQuery ML
Activities:
• Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML
• Lab option 2: Movie Recommendations in BigQuery ML
Topics:
• Why AutoML?
• AutoML Vision
• AutoML NLP
• AutoML tables
Objectives:
• Explore various AutoML products used in machine learning
• Learn to use AutoML to create powerful models without coding