A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
A program that takes Python classes and turns them into CSS classes.

PyCSS What is it? PyCSS is a micro-framework to speed up the process of writing bulk CSS classes. How does it do it? With Python!!! First download the

T.R Batt 0 Aug 03, 2021
A novel dual model approach for categorization of unbalanced skin lesion image classes (Presented technical paper 📃)

A novel dual model approach for categorization of unbalanced skin lesion image classes (Presented technical paper 📃)

1 Jan 19, 2022
The parser of a timetable of tennis matches for Flashscore website

FlashscoreParser The parser of a timetable of tennis matches for Flashscore website. The program collects the schedule of tennis matches for two days

Valendovsky 1 Jul 15, 2022
Install Firefox from Mozilla.org easily, complete with .desktop file creation.

firefox-installer Install Firefox from Mozilla.org easily, complete with .desktop file creation. Dependencies Python 3 Python LXML Debian/Ubuntu: sudo

rany 7 Nov 04, 2022
Decipher using Markov Chain Monte Carlo

Decipher using Markov Chain Monte Carlo

Science étonnante 43 Dec 24, 2022
Estimate the Market Size for Electic and Plug-In Hybrid Vehicles In Africa

Estimate the Market Size for Electic and Plug-In Hybrid Vehicles In Africa The goal of this repository is to use open data repositories to answer the

Leonce Nshuti 0 Feb 21, 2022
Project Guide for ASAM OpenX standards

ASAM Project Guide Important This guide is a work in progress and subject to change! Hosted version available at: ASAM Project Guide (Link) Includes:

ASAM e.V. 2 Mar 17, 2022
This is the improvised version of Dobot Magician which can be implemented for Dobot M1

pydobotM1 This is the edited driver for Dobot M1 version of the original pydobot library intended for use with the Dobot Magician. Here's what you nee

Shaik Abdullah 2 Jul 11, 2022
ioztat is a storage load analysis tool for OpenZFS

ioztat is a storage load analysis tool for OpenZFS. It provides iostat-like statistics at an individual dataset/zvol level.

Jim Salter 116 Nov 25, 2022
Null safe support for Python

Null Safe Python Null safe support for Python. Installation pip install nullsafe Quick Start Dummy Class class Dummy: pass Normal Python code: o =

Paaksing 13 Nov 17, 2022
Very simple encoding scheme that will encode data as a series of OwOs or UwUs.

OwO Encoder Very simple encoding scheme that will encode data as a series of OwOs or UwUs. The encoder is a simple state machine. Still needs a decode

1 Nov 15, 2021
A code ecosystem that helps to find the equate any formula.

A code ecosystem that helps to find the equate any formula. The good part here is that the code finds the formula needed and/or operates on a formula (performs algebra) on it to give you an answer.

SubtleCoder 1 Nov 23, 2021
A minimalist starknet amm adapted from StarkWare's amm.

viscus • A minimalist starknet amm adapted from StarkWare's amm. Directory Structure contracts

Alucard 4 Dec 27, 2021
Open HW & SW for Scanning Electron Microscopes

OpenSEM Project Status: Preliminary The purpose of this project is to create a modern and open-source hardware and software platform for using vintage

Steven Lovegrove 7 Nov 01, 2022
Extend the maya channel box with searchability and colour

channel-box-plus will add search-ability over its attributes, and it will colour user defined attributes, making them easier to distinguish.

Robert Joosten 12 Jun 08, 2022
Manage Procfile-based applications

Foreman Manage Procfile-based applications Installation $ gem install foreman Ruby users should take care not to install foreman in their project's G

David Dollar 5.8k Jan 03, 2023
Lock a program and kills it indefinitely if it is started.

Kill By Lock Lock a program and kills it indefinitely if it is started. How start it? It' simple, you just have to double-click on the python file (.p

1 Jan 12, 2022
Return-Parity-MDP - Towards Return Parity in Markov Decision Processes

Towards Return Parity in Markov Decision Processes Code for the AISTATS 2022 pap

Jianfeng Chi 3 Nov 27, 2022
Semester long, web application project for CSCI 4370/6370 (Database Management)

Database_Project Prototype ideas for website: Computer Science library (Sells books, products, etc.) Code editor Graph visualizer / creator (can save

Jordan Harman 4 Feb 17, 2022
Hack CMU Go Local Project

GoLocal A submission for the annual HackCMU Hackathon. We built a website which connects shopper with local businesses. The goal is to drive consumers

2 Oct 02, 2021