Vaex library for Big Data Analytics of an Airline dataset

Overview

Vaex-Big-Data-Analytics-for-Airline-data

A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics of an Airline dataset.

Author: Nikolas Petrou, MSc in Data Science

Overview

The main part of the work focuses on the exploration a big dataset of 17 GB. Specifically, the dataset contains information on flights within the United States between 1988 and 2018. It can be directly downloaded from: vaex.s3.us-east-2.amazonaws.com.

In addition, in this project the Out-of-Core DataFrames Python library Vaex is employed, in order to visualize, explore acalculate statistics of this big tabular dataset.

The goal of this project is to utilize Vaex to perform an Exploratory Data Analysis (EDA), as well as to predict the arrival delay of a flight using Machine Learning models (regression task).

What is Vaex and why Vaex?

Vaex is a Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (10^9) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Furthermore, Vaex provides wrappers to powerful libraries for predictive models (e.g. Scikit-learn, xgboost) and make them work efficiently with Vaex. Vaex does implement a variety of standard data transformers (e.g. PCA, numerical scalers, categorical encoders) and a very efficient KMeans algorithm that take full advantage. Finally, Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted).

Advantage of using Vaex over using Pandas with a more powerful machine

Switching to a more powerful machine (with more RAM and/or better CPU) may solve some memory issues, but still, Pandas will only use one out of the 32 cores of your fancy machine. With Vaex, all operations are out of the core and executed in parallel and lazily evaluated, allowing for crunching through a billion-row dataset effortlessly.

Data

The dataset has a relatively big size (17 GB), and contains information on flights within the United States between 1988 and 2018. It can be directly downloaded from: vaex.s3.us-east-2.amazonaws.com

Each row-record of the dataset represents an individual flight. Specifically, each record contains information of the airline (UniqueCarrier), airports (origin airport, destination airport) and flight level information such as time schedule (day of week, day of month, month, year), flight distance, departure time and delay, arrival time and delay.

Owner
Nikolas Petrou
M.Sc. Data Science student, University of Cyprus (UCY) Research Assistant at the Laboratory of Internet Computing (LInC) B.Sc degree in Computer Science
Nikolas Petrou
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020] by Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wa

112 Dec 28, 2022
Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Amber Electric Usage Summary This is a command line tool that produces a summary CSV report of an Amber Electric customer's energy consumption and cos

Graham Lea 12 May 26, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
Feature engineering and machine learning: together at last

Feature engineering and machine learning: together at last! Lambdo is a workflow engine which significantly simplifies data analysis by unifying featu

Alexandr Savinov 14 Sep 15, 2022
Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

Binomial Option Pricing Calculator Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required) Background A derivative is a fi

sammuhrai 1 Nov 29, 2021
Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge S

1 Jan 09, 2022
BErt-like Neurophysiological Data Representation

BENDR BErt-like Neurophysiological Data Representation This repository contains the source code for reproducing, or extending the BERT-like self-super

114 Dec 23, 2022
My solution to the book A Collection of Data Science Take-Home Challenges

DS-Take-Home Solution to the book "A Collection of Data Science Take-Home Challenges". Note: Please don't contact me for the dataset. This repository

Jifu Zhao 1.5k Jan 03, 2023
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022
Single-Cell Analysis in Python. Scales to >1M cells.

Scanpy – Single-Cell Analysis in Python Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It inc

Theis Lab 1.4k Jan 05, 2023
Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Damien Farrell 81 Dec 26, 2022
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
Convert monolithic Jupyter notebooks into Ploomber pipelines.

Soorgeon Join our community | Newsletter | Contact us | Blog | Website | YouTube Convert monolithic Jupyter notebooks into Ploomber pipelines. soorgeo

Ploomber 65 Dec 16, 2022
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023
Provide a market analysis (R)

market-study Provide a market analysis (R) - FRENCH Produisez une étude de marché Prérequis Pour effectuer ce projet, vous devrez maîtriser la manipul

1 Feb 13, 2022
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 04, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022