Free Data Engineering course!

Overview

Data Engineering Zoomcamp

Syllabus

Taking the course

Self-paced mode

All the materials of the course are freely available, so you can take the course at your own pace

  • Follow the suggested syllabus (see below) week by week
  • You don't need to fill in the registration form. Just start watching the videos and join Slack
  • Check FAQ if you have problems
  • If you can't find a solution to your problem in FAQ, ask for help in Slack

2022 Cohort

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Syllabus

Week 1: Introduction & Prerequisites

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

More details

Week 2: Data ingestion

  • Data Lake
  • Workflow orchestration
  • Setting up Airflow locally
  • Ingesting data to GCP with Airflow
  • Ingesting data to local Postgres with Airflow
  • Moving data from AWS to GCP (Transfer service)
  • Homework

More details

Week 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitoning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • Integrating BigQuery with Airflow
  • BigQuery Machine Learning

More details

Week 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualising the data with google data studio and metabase

More details

Week 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

More details

Week 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

More details

Week 7, 8 & 9: Project

Putting everything we learned to practice

  • Week 7 and 8: working on your own project
  • Week 9: reviewing your peers

More details

Overview

Architecture diagram

Technologies

  • Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • SQL: Data Analysis & Exploration
  • Airflow: Pipeline Orchestration
  • dbt: Data Transformation
  • Spark: Distributed Processing
  • Kafka: Streaming

Prerequisites

To get most out of this course, you should feel comfortable with coding and command line, and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Tools

For this course you'll need to have the following software installed on your computer:

  • Docker and Docker-Compose
  • Python 3 (e.g. via Anaconda)
  • Google Cloud SDK
  • Terraform

See Week 1 for more details about installing these tools

FAQ

  • Q: I registered, but haven't received a confirmation email. Is it normal? A: Yes, it's normal. It's not automated. But you will receive an email eventually
  • Q: At what time of the day will it happen? A: Office hours will happen on Mondays at 17:00 CET. But everything will be recorded, so you can watch it whenever it's convenient for you
  • Q: Will there be a certificate? A: Yes, if you complete the project
  • Q: I'm 100% not sure I'll be able to attend. Can I still sign up? A: Yes, please do! You'll receive all the updates and then you can watch the course at your own pace.
  • Q: Do you plan to run a ML engineering course as well? A: Glad you asked. We do :)
  • Q: I'm stuck! I've got a technical question! A: Ask on Slack! And check out the student FAQ; many common issues have been answered already. If your issue is solved, please add how you solved it to the document. Thanks!

Our friends

Big thanks to other communities for helping us spread the word about the course:

Check them out - they are cool!

Owner
DataTalksClub
The place to talk about data
DataTalksClub
Return-Parity-MDP - Towards Return Parity in Markov Decision Processes

Towards Return Parity in Markov Decision Processes Code for the AISTATS 2022 pap

Jianfeng Chi 3 Nov 27, 2022
ticguide: quick + painless TESS observing information

ticguide: quick + painless TESS observing information Complementary to the TESS observing tool tvguide (see also WTV), which tells you if your target

Ashley Chontos 5 Nov 05, 2022
Scripts for hosting urbit in production-ish

Urbit Sysops Contains some helpful scripts for hosting Urbit. There are two variants included in this repo: one using docker, and one using plain syst

Jōshin 12 Sep 25, 2022
The refactoring tutorial I wrote for PyConDE 2022. You can also work through the exercises on your own.

Refactoring 101 planet images by Justin Nichol on opengameart.org CC-BY 3.0 Goal of this Tutorial In this tutorial, you will refactor a space travel t

Kristian Rother 9 Jun 10, 2022
Running a complete single-node all-in-one cluster instance of TIBCO ActiveMatrix™ BusinessWorks 6.8.0.

TIBCO ActiveMatrix™ BusinessWorks 6.8 Docker Image Image for running a complete single-node all-in-one cluster instance of TIBCO ActiveMatrix™ Busines

Federico Alpi 1 Dec 10, 2021
Adds a Bake node to Blender's shader node system

Bake to Target This Blender Addon adds a new shader node type capable of reducing the texture-bake step to a single button press. Please note that thi

Thomas 8 Oct 04, 2022
A very simple boarding app with DRF

CRUD project with DRF A very simple boarding app with DRF. About The Project 유저 정보를 갖고 게시판을 다루는 프로젝트 입니다. Version Python: 3.9 DB: PostgreSQL 13 Django

1 Nov 13, 2021
Minimalistic Gridworld Environment (MiniGrid)

Minimalistic Gridworld Environment (MiniGrid) There are other gridworld Gym environments out there, but this one is designed to be particularly simple

Maxime Chevalier-Boisvert 1.7k Jan 03, 2023
Python-Roadmap - Дорожная карта по изучению Python

Python Roadmap Я решил сделать что-то вроде дорожной карты (Roadmap) для изучения языка Python. Возможно, если найдутся желающие дополнять ее, модифиц

Ruslan Prokhorov 1.2k Dec 28, 2022
Ergonomic option parser on top of dataclasses, inspired by structopt.

oppapī Ergonomic option parser on top of dataclasses, inspired by structopt. Usage from typing import Optional from oppapi import from_args, oppapi @

yukinarit 4 Jul 19, 2022
This is a repository built by the community for the community.

Nutshell Machine Learning Machines can see, hear and learn. Welcome to the future 🌍 The repository was built with a tree-like structure in mind, it c

Edem Gold 82 Nov 18, 2022
Simple Kahoot Botter.

Kahoot A simple Botter made in Python 3 for Kahoot.com. Also sorry for the shitty code lol. How to Run You need Python 3 installed on your device. Aft

7 Jun 29, 2022
Python library to natively send files to Trash (or Recycle bin) on all platforms.

Send2Trash -- Send files to trash on all platforms Send2Trash is a small package that sends files to the Trash (or Recycle Bin) natively and on all pl

Andrew Senetar 224 Jan 04, 2023
Battery conservation Python script for ubuntu to enable battery conservation mode at 60% 80% or 90%

Description Batteryconservation is a small python script wich creates an appindicator for ubuntu which can be used to enable / disable battery conserv

3 Jan 04, 2022
YourX: URL Clusterer With Python

YourX | URL Clusterer Screenshots Instructions for running Install requirements

ARPSyndicate 1 Mar 11, 2022
Python plugin/extra to load data files from an external source (such as AWS S3) to a local directory

Data Loader Plugin - Python Table of Content (ToC) Data Loader Plugin - Python Table of Content (ToC) Overview References Python module Python virtual

Cloud Helpers 2 Jan 10, 2022
This is a pretty basic but relatively nice looking Python Pomodoro Timer.

Python Pomodoro-Timer This is a pretty basic but relatively nice looking Pomodoro Timer. Currently its set to a very basic mode, but the funcationalit

EmmHarris 2 Oct 18, 2021
Python program that generates random user from API

RandomUserPy Author kirito sate #modules used requests, json, tkinter, PIL, urllib, io, install requests and PIL modules from pypi pip install Pillow

kiritosate 1 Jan 05, 2022
An Agora Python Flask token generation server

A Flask Starter Application with Login and Registration About A token generation Server using the factory pattern and Blueprints. A forked stripped do

Nii Ayi 1 Jan 21, 2022
Open source stenotype engine

Plover Bringing stenography to everyone. Homepage Releases Wiki Blog Google Group Discord Chat About Installation Getting help Contributing Donations

Open Steno Project 2k Jan 09, 2023