Pyspark Spotify ETL

Description

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

The purpose of this is to help those that want to become Data Engineers, like myself, create their first project.

Essentials

Extra libraries that must be imported: sys, json, datetime.

ETL Execution

Install all the necessary libraries from the Pipfile.
Read the "Token_request_instructions" to get your own refresh token. In case you don't want that you can get one from this website https://developer.spotify.com/console/get-recently-played/ which will have to be changed every hour.
Add your you postgreSQL credentials in the engine variable. In case you'll be using another RDBMS, use this website https://docs.sqlalchemy.org/en/14/core/engines.html.
Create SQL Database/Table (Optional).
Create a bash file. This file is were you'll write down the path to Spark, Python and your script. If this isn't created you'll get the "ModuleNotFoundError" for each module you import inside your script. (Think of this as the ETL's own ~/.bash_profile)
Create a new crontab or use the existing one if you want the job to run on midnight every day.

Extras

To verify that your scheduled job is working you can change the crontab to "* * * * *".
Here is the website https://developer.spotify.com/documentation/general/guides/scopes/ with other Spotify scopes in case you don't want to use "recently played tracks".
Thank you Karolina Sowinska for your DE Beginners guide.

Pyspark Spotify ETL

Related tags

Overview

Pyspark Spotify ETL

Owner

Zipline, a Pythonic Algorithmic Trading Library

Analyzing Covid-19 Outbreaks in Ontario

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Random dataframe and database table generator

Create HTML profiling reports from pandas DataFrame objects

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

A tax calculator for stocks and dividends activities.

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

Python package to transfer data in a fast, reliable, and packetized form.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

A tool to compare differences between dataframes and create a differences report in Excel

Kennedy Institute of Rheumatology University of Oxford Project November 2019

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Building house price data pipelines with Apache Beam and Spark on GCP

This is a python script to navigate and extract the FSD50K dataset

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

Pipeline to convert a haploid assembly into diploid