Data Engineering on Google Cloud
Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data.
This course is comprised of the following four courses:
- Introduction to Data Engineering on Google Cloud
- Build Data Lakes and Data Warehouses with Google Cloud
- Build Batch Data Pipelines on Google Cloud
- Build Streaming Data Pipelines on Google Cloud

What you will learn
- Design scalable data processing systems in Google Cloud.
- Differentiate data architectures and implement data lakehouse and pipeline concepts.
- Build and manage robust streaming and batch data pipelines.
- Utilize AI/ML tools to optimize performance and gain process and data insights.
Prerequisites
- Understanding of data engineering principles, including ETL/ELT processes, data modeling, and common data formats (Avro, Parquet, JSON).
- Familiarity with data architecture concepts, specifically Data Warehouses and Data Lakes.
- Proficiency in SQL for data querying.
- Proficiency in a common programming language (Python recommended).
- Familiarity with using Command Line Interfaces (CLI).
- Familiarity with core Google Cloud concepts and services (Compute, Storage, and Identity management).
Target audience
- Data Engineers, Data Analysts, Data Architects
Training Program
19 modules to master the fundamentals
Course 1 : Introduction to Data Engineering on Google Cloud
Objectives
- Explain the role of a data engineer.
- Understand the differences between a data source and a data sink.
- Explain the different types of data formats.
- Explain the storage solution options on Google Cloud.
- Learn about the metadata management options on Google Cloud.
- Understand how to share datasets with ease using Analytics Hub.
- Understand how to load data into BigQuery using the Google Cloud console or the gcloud CLI.
Topics covered
- βThe role of a data engineer
- βData sources versus data sinks
- βData formats
- βStorage solution options on Google Cloud
- βMetadata management options on Google Cloud
- βSharing datasets using Analytics Hub
Activities
Lab: Loading Data into BigQuery
Quiz
Objectives
- Explain the baseline Google Cloud data replication and migration architecture.
- Understand the options and use cases for the gcloud command-line tool.
- Explain the functionality and use cases for Storage Transfer Service.
- Explain the functionality and use cases for Transfer Appliance.
- Understand the features and deployment of Datastream.
Topics covered
- βReplication and migration architecture
- βThe gcloud command-line tool
- βMoving datasets
- βDatastream
Objectives
- Explain the baseline extract and load architecture diagram.
- Understand the options of the bq command-line tool.
- Explain the functionality and use cases for BigQuery Data Transfer Service.
- Explain the functionality and use cases for BigLake as a non-extract-load pattern.
Topics covered
- βExtract and load architecture
- βThe bq command-line tool
- βBigQuery Data Transfer Service
- βBigLake
Activities
Lab: BigLake: Qwik Start
Quiz
Objectives
- Explain the baseline extract, load, and transform architecture diagram.
- Understand a common ELT pipeline on Google Cloud.
- Learn about BigQuery's SQL scripting and scheduling capabilities.
- Explain the functionality and use cases for Dataform.
Topics covered
- βExtract, load, and transform (ELT) architecture
- βSQL scripting and scheduling with BigQuery
- βDataform
Activities
Lab: Create and Execute a SQL Workflow in Dataform
Quiz
Objectives
- Explain the baseline extract, transform, and load architecture diagram.
- Learn about the GUI tools on Google Cloud used for ETL data pipelines.
- Explain batch data processing using Dataproc.
- Learn how to use Dataproc Serverless for Spark for ETL.
- Explain streaming data processing options.
- Explain the role Bigtable plays in data pipelines.
Topics covered
- βExtract, transform, and load (ETL) architecture
- βGoogle Cloud GUI tools for ETL data pipelines
- βBatch data processing using Dataproc
- βStreaming data processing options
- βBigtable and data pipelines
Activities
Lab: Use Dataproc Serverless for Spark to Load BigQuery (optional)
Lab: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow
Quiz
Objectives
- Explain the automation patterns and options available for pipelines.
- Learn about Cloud Scheduler and Workflows.
- Learn about Cloud Composer.
- Learn about Cloud Run functions.
- Explain the functionality and automation use cases for Eventarc.
Topics covered
- βAutomation patterns and options for pipelines
- βCloud Scheduler and Workflows
- βCloud Composer
- βCloud Run Functions
- βEventarc
Activities
Lab: Use Cloud Run Functions to Load BigQuery (optional)
Quiz
Course 2 : Build Data Lakes and Data Warehouses with Google Cloud
Objectives
- Compare and contrast data lake, data warehouse, and data lakehouse architectures.
- Evaluate the benefits of the lakehouse approach.
Topics covered
- βThe classics: Data lakes and data warehouses
- βThe modern approach: Data lakehouse
- βChoosing the right architecture
Activities
Quiz
Objectives
- Discuss data storage options, including Cloud Storage for files, open table formats like Apache Iceberg, BigQuery for analytic data, and AlloyDB for operational data.
- Understand the role of AlloyDB for operational data use cases.
Topics covered
- βBuilding a data lake foundation
- βIntroduction to Apache Iceberg open table format
- βBigQuery as the central processing engine
- βCombining operational data in AlloyDB
- βCombining operational and analytical data with federated queries
- βReal world use case
Activities
Quiz
Lab: Federated Query with BigQuery
Objectives
- Explain why BigQuery is a scalable data warehousing solution on Google Cloud.
- Discuss the core concepts of BigQuery.
- Understand BigLake's role in creating a unified lakehouse architecture and its integration with BigQuery for external data.
- Learn how BigQuery natively interacts with Apache Iceberg tables via BigLake.
Topics covered
- βBigQuery fundamentals
- βPartitioning and clustering in BigQuery
- βIntroducing BigLake and external tables
Activities
Quiz
Lab: Querying External Data and Iceberg Tables
Objectives
- Implement robust data governance and security practices across the unified data platform, including sensitive data protection and metadata management.
- Explore advanced analytics and machine learning directly on lakehouse data.
Topics covered
- βData governance and security in a unified platform
- βDemo: Data Loss Prevention
- βAnalytics and machine learning on the lakehouse
- βReal-world lakehouse architectures and migration strategies
Activities
Quiz
Objectives
- Reinforce the core principles of Google Cloud's data platform.
Topics covered
- βReview
- βBest practices
Activities
Lab: Getting Started with BigQuery ML
Lab: Vector Search with BigQuery
Course 3 : Build Batch Data Pipelines on Google Cloud
Objectives
- Explain the critical role of a data engineer in developing and maintaining batch data pipelines.
- Describe the core components and typical lifecycle of batch data pipelines from ingestion to downstream consumption.
- Analyze common challenges in batch data processing, such as data volume, quality, complexity, and reliability, and identify key Google Cloud services that can address them.
Topics covered
- βBatch data pipelines and their use cases
- βProcessing and common challenges
Activities
Quiz
Objectives
- Design scalable batch data pipelines for high-volume data ingestion and transformation.
- Optimize batch jobs for high throughput and cost-efficiency using various resource management and performance tuning techniques.
Topics covered
- βDesign batch pipelines
- βLarge scale data transformations
- βDataflow and Serverless for Apache Spark
- βData connections and orchestration
- βExecute an Apache Spark pipeline
- βOptimize batch pipeline performance
Activities
Quiz
Lab: Build a Simple Batch Data Pipeline with Serverless for Apache Spark (optional)
Lab: Build a Simple Batch Data Pipeline with Dataflow Job Builder UI (optional)
Objectives
- Develop data validation rules and cleansing logic to ensure data quality within batch pipelines.
- Implement strategies for managing schema evolution and performing data deduplication in large datasets.
Topics covered
- βBatch data validation and cleansing
- βLog and analyze errors
- βSchema evolution for batch pipelines
- βData integrity and duplication
- βDeduplication with Serverless for Apache Spark
- βDeduplication with Dataflow
Activities
Lab: Validate Data Quality in a Batch Pipeline with Serverless for Apache Spark (optional)
Quiz
Objectives
- Orchestrate complex batch data pipeline workflows for efficient scheduling and lineage tracking.
- Implement robust error handling, monitoring, and observability for batch data pipelines.
Topics covered
- βOrchestration for batch processing
- βCloud Composer
- βUnified observability
- βAlerts and troubleshooting
- βVisual pipeline management
Activities
Lab: Building Batch Pipelines in Cloud Data Fusion
Quiz
Course 4 : Build Streaming Data Pipelines on Google Cloud
Objectives
- Introduce the course learning objectives, and the scenario that will be used to bring hands on learning to building streaming data pipelines.
- Describe the concept of streaming data pipelines, challenges associated with it, and the role of these pipelines within the data engineering process.
Topics covered
- βCourse learning objectives
- βCourse prerequisites
- βThe use case
- βAbout the company
- βThe challenge
- βThe mission
Objectives
- Understand various streaming use cases and their applications, including Streaming ETL, Streaming AI/ML, Streaming Application, and Reverse ETL.
- Identify and describe common sample architectures for streaming data, including Streaming ETL, Streaming AI/ML, Streaming Application, and Reverse ETL.
Topics covered
- βIntroduction to streaming data pipelines on Google Cloud
- βStreaming ETL
- βStreaming AI/ML
- βStreaming applications
- βReverse ETL
Activities
Quiz
Objectives
- Pub/Sub and Managed Service for Apache Kafka: Define messaging concepts, know when to use Pub/Sub or Managed Service for Apache Kafka.
- Dataflow: Describe the service and challenges with streaming data, build and deploy a streaming pipeline.
- BigQuery: Explore various data ingestion methods, use BigQuery continuous queries, BigQuery ETL, and reverse ETL, configure Pub/Sub to BigQuery streaming, architecting BigQuery streaming pipelines.
- Bigtable: Describe the big picture of data movement and interaction, establish a streaming pipeline from Dataflow to Bigtable, analyze the Bigtable continuous data stream for trends using BigQuery, synchronize the trends analysis back into the user-facing application.
Topics covered
- βUnderstanding the products
- βArchitectural considerations for Pub/Sub and Managed Service for Apache Kafka
- βDataflow: The processing powerhouse
- βBigQuery: The analytical engine
- βBigtable: The solution for operational data
Activities
Lab: Stream data with pipelines - Esports use case (optional)
Quiz
Lab: Use Apache Beam and Bigtable to enrich esports downloadable content (DLC) data
Quiz
Lab: Stream e-sports data with Pub/Sub and BigQuery
Quiz
Lab: Monitor e-sports chat with Streamlit
Quiz
Topics covered
- βWhat you've accomplished
- βNext steps
Quality Process
SFEIR Institute's commitment: an excellence approach to ensure the quality and success of all our training programs. Learn more about our quality approach
- Lectures / Theoretical Slides β Presentation of concepts using visual aids (PowerPoint, PDF).
- Technical Demonstration (Demos) β The instructor performs a task or procedure while students observe.
- Guided Labs β Guided practical exercises on software, hardware, or technical environments.
- Quiz / MCQ β Quick knowledge check (paper-based or digital via tools like Kahoot/Klaxoon).
The achievement of training objectives is evaluated at multiple levels to ensure quality:
- Continuous Knowledge Assessment : Verification of knowledge throughout the training via participatory methods (quizzes, practical exercises, case studies) under instructor supervision.
- Progress Measurement : Comparative self-assessment system including an initial diagnostic to determine the starting level, followed by a final evaluation to validate skills development.
- Quality Evaluation : End-of-session satisfaction questionnaire to measure the relevance and effectiveness of the training as perceived by participants.
Train multiple employees
- Volume discounts (multiple seats)
- Private or custom session
- On-site or remote