A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
A Telegram bot to upload files from Telegram or Direct links to Google Drive.

Google Drive Uploader Telegram Bot A Telegram bot to upload files from Telegram or Direct links to Google Drive. Features Telegram files support. Dire

IDNCoderX 21 Dec 05, 2022
Predict the Site EUI, given the characteristics of the building and the weather data for the location of the building.

wids_datathon_2022 Description: Contains a data pipeline used to predict energy EUI Goals: Dataset exploration Automating the parameter fitting, gener

1 Mar 25, 2022
pyDuinoCoin is a simple python integration for the DuinoCoin REST API, that allows developers to communicate with DuinoCoin Master Server

PyDuinoCoin PyDuinoCoin is a simple python integration for the DuinoCoin REST API, that allows developers to communicate with DuinoCoin Main Server. I

BackrndSource 6 Jul 14, 2022
Aria/qBittorrent Telegram mirror/leech bot

This Telegram Bot written in Python for mirroring files on the Internet to our Google Drive or Telegram. Based on python-aria-mirror-bot Features: qBi

Anas 2.1k Jan 04, 2023
칼만 필터는 어렵지 않아(저자 김성필) 파이썬 코드(Unofficial)

KalmanFilter_Python 칼만 필터는 어렵지 않아(저자 김성필) 책을 공부하면서, Matlab 코드를 Python으로 변환한 것입니다. Contents Part01. Recursive Filter Chapter01. Average Filter Chapter0

Donghun Park 20 Oct 28, 2022
A Simple Google Translate Bot By VndGroup ❤️ Made With Python

VndGroup Google Translator Heroku Deploy ❤️ Functions This Bot Can Translate 95 Languages We Can Set Custom Language Group Support Mandatory Vars [+]

Venuja Sadew 1 Oct 09, 2022
Telegram bot to provide Telegram user/group/channel information

Whois-TeLeTiPs Telegram bot to provide Telegram user/group/channel information Deployment Methods Heroku Config Vars API_ID : Telegram API_ID, get it

11 Oct 21, 2022
A Telegram Bot to return Youtube Video Tags Using YoutubeTags API

YouTube-TagFind-Bot A Telegram Bot to return Youtube Video Tags Using YoutubeTags API YoutubeTags API Wrapper YoutubeTags is a python third-party api

Nuhman Pk 9 Aug 25, 2022
The elegance of Airflow + the power of AWS

Orkestra The elegance of Airflow + the power of AWS

Stephan Fitzpatrick 42 Nov 01, 2022
YuuScource - A Discord bot made with Pycord

Yuu A Discord bot made with Pycord Features Not much lol • Easy to use commands

kekda 9 Feb 17, 2022
Bomber-X - A SMS Bomber made with Python

Bomber-X A SMS Bomber made with Python Linux/Termux apt update apt upgrade apt i

S M Shahriar Zarir 2 Mar 10, 2022
A repository of publicly verifiable token Sale contracts

Token-Sale-Plutus-Contract A repository of publicly verifiable token sale and royalty contracts. This will be the storage solution since it is easily

Logical Mechanism 29 Aug 18, 2022
A file-based quote bot written in Python

Let's Write a Python Quote Bot! This repository will get you started with building a quote bot in Python. It's meant to be used along with the Learnin

0 Jan 20, 2022
This repo provides the source code for "Cross-Domain Adaptive Teacher for Object Detection".

Cross-Domain Adaptive Teacher for Object Detection This is the PyTorch implementation of our paper: Cross-Domain Adaptive Teacher for Object Detection

Meta Research 91 Dec 12, 2022
Typed interactions with the GitHub API v3

PyGitHub PyGitHub is a Python library to access the GitHub API v3 and Github Enterprise API v3. This library enables you to manage GitHub resources su

5.7k Jan 06, 2023
Gathers data and displays metrics related to climate change and resource depletion on a PowerBI report.

Apocalypse Status Dashboard Purpose Climate change and resource depletion are grave long-term dangers. The code in this repository will pull data from

Summer Is Here 1 Nov 12, 2021
A collection of tools for managing Jira issues for the RHODS project

RHODS-Jira-Tools A collection of tools for managing Jira issues for the RHODS project move_to_qa.py This script handles transitioning a given Jira iss

Alex Corvin 1 Sep 20, 2022
Discord group chat leaver.

Discord group chat leaver I know many people who have fallen victim to these weird group chat spammers including me. I made this script to help those

cliphd 3 Feb 27, 2022
Pydf: A modular Telegram Bot which provides Pdf Tools using PyPdf2

pyDF-Bot 🌍 Pydf - Pyrogram Document File Bot, a modular Telegram Bot which prov

HyDrix 2 Feb 18, 2022
Dante, my discord bot. Open source project in development and not optimized for other filesystems, install and setup script in development

DanteMode (In private development for ~6 months) Dante, my discord bot. Open source project in development and not optimized for other filesystems, in

2 Nov 05, 2021