ETL pipeline on movie data using Python and postgreSQL

Last update: Jul 07, 2021

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

This project consisted on a automated Extraction, Transformation and Load pipeline. This ETL extracted movie data from wikipedia, kaggle, and MovieLens to clean it, transform it, and merge it using Pandas. The product was a merged table with movies and ratings loaded to PostgreSQL.

Resources

Data sources:
- movies_metadata.csv
- ratings.csv
- wikipedia_movies.json
Software:
- Python
- PostgreSQL
- Pandas
- SQLAlchemy
- Regular Expressions

Results

Final output table: FINAL_Merged_Movies_and_Ratings.csv
Datasets uploaded to PostgreSQL for other users to analyze movie data (Hacketon):

Summary

The pipeline was created under the following assumptions:

I was able to join the wikipedia, kaggle, and ratings movie data on the IMDB ID column.
The wikipedia dataset didn't have a IMDB ID, so I had to extract it from the url link given.
Each dataset had to be cleaned on their own because they had overlapping columns, suck as 'Director' and 'Directed By', unecessary columns, many null values, TV shows, outliers, duplicates, incorrect data types, formatting, and other errors.
The wikipedia movie data was in json format.
Not every every movie had a rating for each rating level.
The ratings dataset had more than 26 million entries which generated a time constraint and a processing data challenge.

ETL pipeline on movie data using Python and postgreSQL

Related tags

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

Resources

Results

Summary

Owner

Juan Nicolas Serrano

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Toolchest provides APIs for scientific and bioinformatic data analysis.

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

Pyspark Spotify ETL

Pipeline to convert a haploid assembly into diploid

CINECA molecular dynamics tutorial set

CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

A data structure that extends pyspark.sql.DataFrame with metadata information.

cLoops2: full stack analysis tool for chromatin interactions

This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Automated Exploration Data Analysis on a financial dataset

Developed for analyzing the covariance for OrcVIO

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

A DSL for data-driven computational pipelines

For making Tagtog annotation into csv dataset