Data Engineering on Google Cloud
Get hands-on experience designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, and analyze data. This course covers structured, unstructured, and streaming data.

What you will learn
- Design and build data processing systems on Google Cloud.
- Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.
- Derive business insights from extremely large datasets using BigQuery.
- Leverage unstructured data using Spark and ML APIs on Dataproc.
- Enable instant insights from streaming data.
Prerequisites
- Prior Google Cloud experience using Cloud Shell and accessing products from the Google Cloud console.
- Basic proficiency with a common query language such as SQL.
- Experience with data modeling and ETL (extract, transform, load) activities.
- Experience developing applications using a common programming language such as Python.
Target audience
- Data engineers, Database administrators, System administrators
Training Program
18 modules to master the fundamentals
Objectives
- Explain the role of a data engineer.
- Understand the differences between a data source and a data sink.
- Explain the different types of data formats.
- Explain the storage solution options on Google Cloud.
- Learn about the metadata management options on Google Cloud.
- Understand how to share datasets with ease using Analytics Hub.
- Understand how to load data into BigQuery using the Google Cloud console and/or the gcloud CLI.
Topics covered
- →The role of a data engineer
- →Data sources versus data syncs
- →Data formats
- →Storage solution options on Google Cloud
- →Metadata management options on Google Cloud
- →Share datasets using Analytics Hub
Activities
Lab: Loading Data into BigQuery
Objectives
- Explain the baseline Google Cloud data replication and migration architecture.
- Understand the options and use cases for the gcloud command line tool.
- Explain the functionality and use cases for the Storage Transfer Service.
- Explain the functionality and use cases for the Transfer Appliance.
- Understand the features and deployment of Datastream.
Topics covered
- →Replication and migration architecture
- →The gcloud command line tool
- →Moving datasets
- →Datastream
Activities
Lab: Datastream: PostgreSQL Replication to BigQuery
Objectives
- Explain the baseline extract and load architecture diagram.
- Understand the options of the bq command line tool.
- Explain the functionality and use cases for the BigQuery Data Transfer Service.
- Explain the functionality and use cases for BigLake as a non-extract-load pattern.
Topics covered
- →Extract and load architecture
- →The bq command line tool
- →BigQuery Data Transfer Service
- →BigLake
Activities
Lab: BigLake: Qwik Start
Objectives
- Explain the baseline extract, load, and transform architecture diagram.
- Understand a common ELT pipeline on Google Cloud.
- Learn about BigQuery's SQL scripting and scheduling capabilities.
- Explain the functionality and use cases for Dataform.
Topics covered
- →Extract, load, and transform (ELT) architecture
- →SQL scripting and scheduling with BigQuery
- →Dataform
Activities
Lab: Create and Execute a SQL Workflow in Dataform
Objectives
- Explain the baseline extract, transform, and load architecture diagram.
- Learn about the GUI tools on Google Cloud used for ETL data pipelines.
- Explain batch data processing using Dataproc.
- Learn to use Dataproc Serverless for Spark for ETL.
- Explain streaming data processing options.
- Explain the role Bigtable plays in data pipelines.
Topics covered
- →Extract, transform, and load (ETL) architecture
- →Google Cloud GUI tools for ETL data pipelines
- →Batch data processing using Dataproc
- →Streaming data processing options
- →Bigtable and data pipelines
Activities
Lab: Use Dataproc Serverless for Spark to Load BigQuery
Lab: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow
Objectives
- Explain the automation patterns and options available for pipelines.
- Learn about Cloud Scheduler and workflows.
- Learn about Cloud Composer.
- Learn about Cloud Run functions.
- Explain the functionality and automation use cases for Eventarc.
Topics covered
- →Automation patterns and options for pipelines
- →Cloud Scheduler and Workflows
- →Cloud Composer
- →Cloud Run functions
- →Eventarc
Activities
Lab: Use Cloud Run Functions to Load BigQuery
Objectives
- Discuss the challenges of data engineering, and how building data pipelines in the cloud helps to address these.
- Review and understand the purpose of a data lake versus a data warehouse, and when to use which.
Topics covered
- →Data engineer's role
- →Data engineering challenges
- →Introduction to BigQuery
- →Data lakes and data warehouses
- →Transactional databases versus data warehouses
- →Effective partnership with other data teams
- →Management of data access and governance
- →Building of production-ready pipelines
- →Google Cloud customer case study
Activities
Lab: Using BigQuery to Do Analysis
Objectives
- Discuss why Cloud Storage is a great option for building a data lake on Google Cloud.
- Explain how to use Cloud SQL for a relational data lake.
Topics covered
- →Introduction to data lakes
- →Data storage and ETL options on Google Cloud
- →Building of a data lake using Cloud Storage
- →Secure Cloud Storage
- →Store all sorts of data types
- →Cloud SQL as your OLTP system
Activities
Lab: Loading Taxi Data into Cloud SQL
Objectives
- Discuss requirements of a modern warehouse.
- Explain why BigQuery is the scalable data warehousing solution on Google Cloud.
- Discuss the core concepts of BigQuery and review options of loading data into BigQuery.
Topics covered
- →The modern data warehouse
- →Introduction to BigQuery
- →Get started with BigQuery
- →Loading of data into BigQuery
- →Exploration of schemas
- →Schema design
- →Nested and repeated fields
- →Optimization with partitioning and clustering
Activities
Lab: Working with JSON and Array Data in BigQuery
Lab: Partitioned Tables in BigQuery
Objectives
- Review different methods of loading data into your data lakes and warehouses: EL, ELT, and ETL.
Topics covered
- →EL, ELT, ETL
- →Quality considerations
- →Ways of executing operations in BigQuery
- →Shortcomings
- →ETL to solve data quality issues
Objectives
- Review the Hadoop ecosystem.
- Discuss how to lift and shift your existing Hadoop workloads to the cloud using Dataproc.
- Explain when you would use Cloud Storage instead of HDFS storage.
- Explain how to optimize Dataproc jobs.
Topics covered
- →The Hadoop ecosystem
- →Run Hadoop on Dataproc
- →Cloud Storage instead of HDFS
- →Optimize Dataproc
Activities
Lab: Running Apache Spark Jobs on Dataproc
Objectives
- Identify features customers value in Dataflow.
- Discuss core concepts in Dataflow.
- Review the use of Dataflow templates and SQL.
- Write a simple Dataflow pipeline and run it both locally and on the cloud.
- Identify Map and Reduce operations, execute the pipeline, and use command line parameters.
- Read data from BigQuery into Dataflow and use the output of a pipeline as a side-input to another pipeline.
Topics covered
- →Introduction to Dataflow
- →Reasons why customers value Dataflow
- →Dataflow pipelines
- →Aggregating with GroupByKey and Combine
- →Side inputs and windows
- →Dataflow templates
Activities
Lab: A Simple Dataflow Pipeline (Python/Java)
Lab: MapReduce in Beam (Python/Java)
Lab: Side Inputs (Python/Java)
Objectives
- Discuss how to manage your data pipelines with Cloud Data Fusion and Cloud Composer.
- Summarize how Cloud Data Fusion allows data analysts and ETL developers to wrangle data and build pipelines in a visual way.
- Describe how Cloud Composer can help to orchestrate the work across multiple Google Cloud services.
Topics covered
- →Build batch data pipelines visually with Cloud Data Fusion (Components, UI overview, Building a pipeline, Exploring data using Wrangler)
- →Orchestrate work between Google Cloud services with Cloud Composer (Apache Airflow environment, DAGs and operators, Workflow scheduling, Monitoring and logging)
Activities
Lab: Building and Executing a Pipeline Graph in Data Fusion
Lab: An Introduction to Cloud Composer
Objectives
- Explain streaming data processing.
- Identify the Google Cloud products and tools that can help address streaming data challenges.
Topics covered
- →Process streaming data
Objectives
- Describe the Pub/Sub service.
- Explain how Pub/Sub works.
- Simulate real-time streaming sensor data using Pub/Sub.
Topics covered
- →Introduction to Pub/Sub
- →Pub/Sub push versus pull
- →Publishing with Pub/Sub code
Activities
Lab: Publish Streaming Data into Pub/Sub
Objectives
- Describe the Dataflow service.
- Build a stream processing pipeline for live traffic data.
- Demonstrate how to handle late data using watermarks, triggers, and accumulation.
Topics covered
- →Streaming data challenges
- →Dataflow windowing
Activities
Lab: Streaming Data Pipelines
Objectives
- Describe how to perform ad-hoc analysis on streaming data using BigQuery and dashboards.
- Discuss Bigtable as a low-latency solution.
- Describe how to architect for Bigtable and how to ingest data into Bigtable.
- Highlight performance considerations for the relevant services.
Topics covered
- →Streaming into BigQuery and visualizing results
- →High-throughput streaming with Bigtable
- →Optimizing Bigtable performance
Activities
Lab: Streaming Analytics and Dashboards
Lab: Generate Personalized Email Content with BigQuery Continuous Queries and Gemini
Lab: Streaming Data Pipelines into Bigtable
Objectives
- Review some of BigQuery's advanced analysis capabilities.
- Discuss ways to improve query performance.
Topics covered
- →Analytic window functions
- →GIS functions
- →Performance considerations
Activities
Lab: Optimizing Your BigQuery Queries for Performance
Quality Process
SFEIR Institute's commitment: an excellence approach to ensure the quality and success of all our training programs. Learn more about our quality approach
- Lectures / Theoretical Slides — Presentation of concepts using visual aids (PowerPoint, PDF).
- Technical Demonstration (Demos) — The instructor performs a task or procedure while students observe.
- Guided Labs — Guided practical exercises on software, hardware, or technical environments.
- Quiz / MCQ — Quick knowledge check (paper-based or digital via tools like Kahoot/Klaxoon).
The achievement of training objectives is evaluated at multiple levels to ensure quality:
- Continuous Knowledge Assessment : Verification of knowledge throughout the training via participatory methods (quizzes, practical exercises, case studies) under instructor supervision.
- Progress Measurement : Comparative self-assessment system including an initial diagnostic to determine the starting level, followed by a final evaluation to validate skills development.
- Quality Evaluation : End-of-session satisfaction questionnaire to measure the relevance and effectiveness of the training as perceived by participants.
Upcoming sessions
Train multiple employees
- Volume discounts (multiple seats)
- Private or custom session
- On-site or remote