Projects that implement various aspects of Data Engineering.

Last update: Oct 14, 2021

Related tags

Overview

DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -

AWS S3 - Source datasets.

AWS Redshift
>for staging extracted data
>for storing the resultant data model (facts and dimensions)

Data model designed for this project consists of a star schema.

Table and attribute details are -

Fact Table
songplays: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables
users: user_id, first_name, last_name, gender, level
songs: song_id, title, artist_id, year, duration
artists: artist_id, name, location, lattitude, longitude
time: start_time, hour, day, week, month, year, weekday

Source datasets to be extracted into dimension model are -

There are two json files for

Song data: s3://udacity-dend/song_data - Data for all songs with their respective artists available in application library.

Log data: s3://udacity-dend/log_data - Data for user events and activity activity on the application.

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

Create sql_queries.py - to design and build tables for proposed data model

Run create_tables.py - to create tables by implementing the database queries from sql_queries.py

Run etl.py - to implement the data pipeline built over the data model which extract, stage and load data from AWS S3 to DWH on AWS Redshift

Design and fire analytical queries on the populated data model to gain insights of user events over streaming application

Projects that implement various aspects of Data Engineering.

Related tags

Overview

DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -

Data model designed for this project consists of a star schema.

Table and attribute details are -

Source datasets to be extracted into dimension model are -

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

Owner

Approximate Nearest Neighbor Search for Sparse Data in Python!

Bamboolib - a GUI for pandas DataFrames

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

International Space Station data with Python research 🌎

Generates a simple report about the current Covid-19 cases and deaths in Malaysia

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Implementation in Python of the reliability measures such as Omega.

ETL pipeline on movie data using Python and postgreSQL

A variant of LinUCB bandit algorithm with local differential privacy guarantee

Pipeline to convert a haploid assembly into diploid

Picka: A Python module for data generation and randomization.

Python Kalman filtering and optimal estimation library. Implements Kalman filter, particle filter, Extended Kalman filter, Unscented Kalman filter, g-h (alpha-beta), least squares, H Infinity, smoothers, and more. Has companion book 'Kalman and Bayesian Filters in Python'.

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Tools for analyzing data collected with a custom unity-based VR for insects.

Developed for analyzing the covariance for OrcVIO

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Integrate bus data from a variety of sources (batch processing and real time processing).

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.