Data Engineering on Google Cloud
Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data.
This course is comprised of the following four courses:
- Introduction to Data Engineering on Google Cloud
- Build Data Lakes and Data Warehouses with Google Cloud
- Build Batch Data Pipelines on Google Cloud
- Build Streaming Data Pipelines on Google Cloud

What you will learn
- Design scalable data processing systems in Google Cloud.
- Differentiate data architectures and implement data lakehouse and pipeline concepts.
- Build and manage robust streaming and batch data pipelines.
- Utilize AI/ML tools to optimize performance and gain process and data insights.
Prerequisites
- Understanding of data engineering principles, including ETL/ELT processes, data modeling, and common data formats (Avro, Parquet, JSON).
- Familiarity with data architecture concepts, specifically Data Warehouses and Data Lakes.
- Proficiency in SQL for data querying.
- Proficiency in a common programming language (Python recommended).
- Familiarity with using Command Line Interfaces (CLI).
- Familiarity with core Google Cloud concepts and services (Compute, Storage, and Identity management).
Target audience
- Data Engineers, Data Analysts, Data Architects
Training Program
19 modules to master the fundamentals
Course 1 : Introduction to Data Engineering on Google Cloud
Objectives
- Explain the role of a data engineer.
- Understand the differences between a data source and a data sink.
- Explain the different types of data formats.
- Explain the storage solution options on Google Cloud.
- Learn about the metadata management options on Google Cloud.
- Understand how to share datasets with ease using Analytics Hub.
- Understand how to load data into BigQuery using the Google Cloud console or the gcloud CLI.
Topics covered
- →The role of a data engineer
- →Data sources versus data sinks
- →Data formats
- →Storage solution options on Google Cloud
- →Metadata management options on Google Cloud
- →Sharing datasets using Analytics Hub
Activities
Lab: Loading Data into BigQuery
Quiz
Objectives
- Explain the baseline Google Cloud data replication and migration architecture.
- Understand the options and use cases for the gcloud command-line tool.
- Explain the functionality and use cases for Storage Transfer Service.
- Explain the functionality and use cases for Transfer Appliance.
- Understand the features and deployment of Datastream.
Topics covered
- →Replication and migration architecture
- →The gcloud command-line tool
- →Moving datasets
- →Datastream
Objectives
- Explain the baseline extract and load architecture diagram.
- Understand the options of the bq command-line tool.
- Explain the functionality and use cases for BigQuery Data Transfer Service.
- Explain the functionality and use cases for BigLake as a non-extract-load pattern.
Topics covered
- →Extract and load architecture
- →The bq command-line tool
- →BigQuery Data Transfer Service
- →BigLake
Activities
Lab: BigLake: Qwik Start
Quiz
Objectives
- Explain the baseline extract, load, and transform architecture diagram.
- Understand a common ELT pipeline on Google Cloud.
- Learn about BigQuery's SQL scripting and scheduling capabilities.
- Explain the functionality and use cases for Dataform.
Topics covered
- →Extract, load, and transform (ELT) architecture
- →SQL scripting and scheduling with BigQuery
- →Dataform
Activities
Lab: Create and Execute a SQL Workflow in Dataform
Quiz
Objectives
- Explain the baseline extract, transform, and load architecture diagram.
- Learn about the GUI tools on Google Cloud used for ETL data pipelines.
- Explain batch data processing using Dataproc.
- Learn how to use Dataproc Serverless for Spark for ETL.
- Explain streaming data processing options.
- Explain the role Bigtable plays in data pipelines.
Topics covered
- →Extract, transform, and load (ETL) architecture
- →Google Cloud GUI tools for ETL data pipelines
- →Batch data processing using Dataproc
- →Streaming data processing options
- →Bigtable and data pipelines
Activities
Lab: Use Dataproc Serverless for Spark to Load BigQuery (optional)
Lab: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow
Quiz
Objectives
- Explain the automation patterns and options available for pipelines.
- Learn about Cloud Scheduler and Workflows.
- Learn about Cloud Composer.
- Learn about Cloud Run functions.
- Explain the functionality and automation use cases for Eventarc.
Topics covered
- →Automation patterns and options for pipelines
- →Cloud Scheduler and Workflows
- →Cloud Composer
- →Cloud Run Functions
- →Eventarc
Activities
Lab: Use Cloud Run Functions to Load BigQuery (optional)
Quiz
Course 2 : Build Data Lakes and Data Warehouses with Google Cloud
Objectives
- Compare and contrast data lake, data warehouse, and data lakehouse architectures.
- Evaluate the benefits of the lakehouse approach.
Topics covered
- →The classics: Data lakes and data warehouses
- →The modern approach: Data lakehouse
- →Choosing the right architecture
Activities
Quiz
Objectives
- Discuss data storage options, including Cloud Storage for files, open table formats like Apache Iceberg, BigQuery for analytic data, and AlloyDB for operational data.
- Understand the role of AlloyDB for operational data use cases.
Topics covered
- →Building a data lake foundation
- →Introduction to Apache Iceberg open table format
- →BigQuery as the central processing engine
- →Combining operational data in AlloyDB
- →Combining operational and analytical data with federated queries
- →Real world use case
Activities
Quiz
Lab: Federated Query with BigQuery
Objectives
- Explain why BigQuery is a scalable data warehousing solution on Google Cloud.
- Discuss the core concepts of BigQuery.
- Understand BigLake's role in creating a unified lakehouse architecture and its integration with BigQuery for external data.
- Learn how BigQuery natively interacts with Apache Iceberg tables via BigLake.
Topics covered
- →BigQuery fundamentals
- →Partitioning and clustering in BigQuery
- →Introducing BigLake and external tables
Activities
Quiz
Lab: Querying External Data and Iceberg Tables
Objectives
- Implement robust data governance and security practices across the unified data platform, including sensitive data protection and metadata management.
- Explore advanced analytics and machine learning directly on lakehouse data.
Topics covered
- →Data governance and security in a unified platform
- →Demo: Data Loss Prevention
- →Analytics and machine learning on the lakehouse
- →Real-world lakehouse architectures and migration strategies
Activities
Quiz
Objectives
- Reinforce the core principles of Google Cloud's data platform.
Topics covered
- →Review
- →Best practices
Activities
Lab: Getting Started with BigQuery ML
Lab: Vector Search with BigQuery
Course 3 : Build Batch Data Pipelines on Google Cloud
Objectives
- Explain the critical role of a data engineer in developing and maintaining batch data pipelines.
- Describe the core components and typical lifecycle of batch data pipelines from ingestion to downstream consumption.
- Analyze common challenges in batch data processing, such as data volume, quality, complexity, and reliability, and identify key Google Cloud services that can address them.
Topics covered
- →Batch data pipelines and their use cases
- →Processing and common challenges
Activities
Quiz
Objectives
- Design scalable batch data pipelines for high-volume data ingestion and transformation.
- Optimize batch jobs for high throughput and cost-efficiency using various resource management and performance tuning techniques.
Topics covered
- →Design batch pipelines
- →Large scale data transformations
- →Dataflow and Serverless for Apache Spark
- →Data connections and orchestration
- →Execute an Apache Spark pipeline
- →Optimize batch pipeline performance
Activities
Quiz
Lab: Build a Simple Batch Data Pipeline with Serverless for Apache Spark (optional)
Lab: Build a Simple Batch Data Pipeline with Dataflow Job Builder UI (optional)
Objectives
- Develop data validation rules and cleansing logic to ensure data quality within batch pipelines.
- Implement strategies for managing schema evolution and performing data deduplication in large datasets.
Topics covered
- →Batch data validation and cleansing
- →Log and analyze errors
- →Schema evolution for batch pipelines
- →Data integrity and duplication
- →Deduplication with Serverless for Apache Spark
- →Deduplication with Dataflow
Activities
Lab: Validate Data Quality in a Batch Pipeline with Serverless for Apache Spark (optional)
Quiz
Objectives
- Orchestrate complex batch data pipeline workflows for efficient scheduling and lineage tracking.
- Implement robust error handling, monitoring, and observability for batch data pipelines.
Topics covered
- →Orchestration for batch processing
- →Cloud Composer
- →Unified observability
- →Alerts and troubleshooting
- →Visual pipeline management
Activities
Lab: Building Batch Pipelines in Cloud Data Fusion
Quiz
Course 4 : Build Streaming Data Pipelines on Google Cloud
Objectives
- Introduce the course learning objectives, and the scenario that will be used to bring hands on learning to building streaming data pipelines.
- Describe the concept of streaming data pipelines, challenges associated with it, and the role of these pipelines within the data engineering process.
Topics covered
- →Course learning objectives
- →Course prerequisites
- →The use case
- →About the company
- →The challenge
- →The mission
Objectives
- Understand various streaming use cases and their applications, including Streaming ETL, Streaming AI/ML, Streaming Application, and Reverse ETL.
- Identify and describe common sample architectures for streaming data, including Streaming ETL, Streaming AI/ML, Streaming Application, and Reverse ETL.
Topics covered
- →Introduction to streaming data pipelines on Google Cloud
- →Streaming ETL
- →Streaming AI/ML
- →Streaming applications
- →Reverse ETL
Activities
Quiz
Objectives
- Pub/Sub and Managed Service for Apache Kafka: Define messaging concepts, know when to use Pub/Sub or Managed Service for Apache Kafka.
- Dataflow: Describe the service and challenges with streaming data, build and deploy a streaming pipeline.
- BigQuery: Explore various data ingestion methods, use BigQuery continuous queries, BigQuery ETL, and reverse ETL, configure Pub/Sub to BigQuery streaming, architecting BigQuery streaming pipelines.
- Bigtable: Describe the big picture of data movement and interaction, establish a streaming pipeline from Dataflow to Bigtable, analyze the Bigtable continuous data stream for trends using BigQuery, synchronize the trends analysis back into the user-facing application.
Topics covered
- →Understanding the products
- →Architectural considerations for Pub/Sub and Managed Service for Apache Kafka
- →Dataflow: The processing powerhouse
- →BigQuery: The analytical engine
- →Bigtable: The solution for operational data
Activities
Lab: Stream data with pipelines - Esports use case (optional)
Quiz
Lab: Use Apache Beam and Bigtable to enrich esports downloadable content (DLC) data
Quiz
Lab: Stream e-sports data with Pub/Sub and BigQuery
Quiz
Lab: Monitor e-sports chat with Streamlit
Quiz
Topics covered
- →What you've accomplished
- →Next steps
Quality Process
SFEIR Institute's commitment: an excellence approach to ensure the quality and success of all our training programs. Learn more about our quality approach
- Lectures / Theoretical Slides — Presentation of concepts using visual aids (PowerPoint, PDF).
- Technical Demonstration (Demos) — The instructor performs a task or procedure while students observe.
- Guided Labs — Guided practical exercises on software, hardware, or technical environments.
- Quiz / MCQ — Quick knowledge check (paper-based or digital via tools like Kahoot/Klaxoon).
The achievement of training objectives is evaluated at multiple levels to ensure quality:
- Continuous Knowledge Assessment : Verification of knowledge throughout the training via participatory methods (quizzes, practical exercises, case studies) under instructor supervision.
- Progress Measurement : Comparative self-assessment system including an initial diagnostic to determine the starting level, followed by a final evaluation to validate skills development.
- Quality Evaluation : End-of-session satisfaction questionnaire to measure the relevance and effectiveness of the training as perceived by participants.
Upcoming sessions
Train multiple employees
- Volume discounts (multiple seats)
- Private or custom session
- On-site or remote