Processing NYC Taxi Data using PySpark ETL pipeline

Description

This is an project to extract, transform, and load large amount of data from NYC Taxi Rides database (Hosted on AWS S3). It extracts data from CSV files of large size (~2GB per month) and applies transformations such as datatype conversions, drop unuseful rows/columns, etc. Finally, the data is written back in parquet format. This saves time for tasks such as machine learning. It also saves a huge amount of space (~97% space reduction from csv to parquet) making it easy to store for downstream tasks.

How to use it (Using GCP as the cloud service of choice)

Setup a bucket on Google Cloud Storage
Use get_raw_data.sh to download raw data from s3 in the form of CSV files to the GCS bucket
Setup a GCP dataproc service
SSH into the master node and copy the entire project folder to the Persistent Disk
Edit the configuration file for application
Submit the job: submit-spark main.py --filename [raw_data_filename] or Execute submit_job.sh with appropriate args

Project structure

root/
|---bash/
    |---create_cluster.sh
    |---install.sh
|---configs/
    |---app_config.json
    |---cols_config.json
|---jobs/
    |---etl_tasks.py
    |---transformations.py
|   get_raw_data.sh
|   main.py
|   requirements.txt
|   submit_job.sh

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Related tags

Overview

Processing NYC Taxi Data using PySpark ETL pipeline

Description

How to use it (Using GCP as the cloud service of choice)

Project structure

Owner

Unnikrishnan

PyEmits, a python package for easy manipulation in time-series data.

Employee Turnover Analysis

The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

A distributed block-based data storage and compute engine

💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Office365 (Microsoft365) audit log analysis tool

Single-Cell Analysis in Python. Scales to >1M cells.

Zipline, a Pythonic Algorithmic Trading Library

Implementation in Python of the reliability measures such as Omega.

CPSPEC is an astrophysical data reduction software for timing

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

pandas: powerful Python data analysis toolkit

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

Tools for working with MARC data in Catalogue Bridge.

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

Scraping and analysis of leetcode-compensations page.