Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

This project was completed in an attempt to become better acquainted with the latest big data tools. Further details can be found on my blog here.

The world is producing an exponentially increasing amount of digital data, and the tools we use to derive insights from data are evolving just as rapidly.

In recent years, a new architecture called the Data Lakehouse has begun to gain prominence as an enterprise solution to storing and processing big data. This trend piqued my interest and led to my exploration of some of the key underlying technologies fueling the revolution.

Of particular focus are two open-source technologies: Delta Lake and Apache Spark. Delta Lake provides a metadata layer to data lakes, bringing ACID transaction guarantees and time travel to a heretofore messy approach to data science at scale. Apache Spark offers a distributed processing engine for a diverse set of workloads (e.g., SQL queries, machine learning, stream processing), which can be programmed in Python, R, Scala, etc.

It is my belief that these technologies―among several others further detailed on my blog―will play a major role in how businesses leverage the power of data going forward. As such, this research prepares me well to confront many emerging data engineering and data science challenges.

The demonstration linked below is deployed using the Binder service, which processes a Jupyter notebook in the cloud, based on a custom Docker image described by the supporting files in this repository.

Live Link:

Contained in this repository:

Jupyter notebook demonstrating Apache Spark and Delta Lake
Files to construct a custom Docker image deployed using Binder
- Dockerfile
- docker-compose.yml
- requirements.txt

Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Related tags

Overview

Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

Live Link:

Contained in this repository:

Owner

Functional tensors for probabilistic programming

Very useful and necessary functions that simplify working with data

A set of tools to analyse the output from TraDIS analyses

Big Data & Cloud Computing for Oceanography

Vaex library for Big Data Analytics of an Airline dataset

Data-sets from the survey and analysis

Pipetools enables function composition similar to using Unix pipes.

vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

DefAP is a program developed to facilitate the exploration of a material's defect chemistry

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

CSV database for chihuahua (HUAHUA) blockchain transactions

Stitch together Nanopore tiled amplicon data without polishing a reference

Important dataframe statistics with a single command

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Data science/Analysis Health Care Portfolio

General Assembly's 2015 Data Science course in Washington, DC