A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
A Telegram Bot to display Codeforces Contest Ranklist

CFRankListBot A bot that displays the top ranks for a Codeforces contest. Participants' Details All the details of a participant is in the utils/__ini

Code IIEST 5 Dec 25, 2021
This is a script to export logs from AWS CloudTrail to a local file.

cloudtrail-export-logs This is a script to export logs from AWS CloudTrail to a local file. Getting Started Prerequisites python 3 boto3 pip Installin

Claick Assunção de Oliveira 2 Jan 02, 2022
Programa de código abierto para probar el API de Bitso, el exchange más importante de América Latina.

Bitso Semiautomático Programa de código abierto para probar el API de Bitso, el exchange más importante de América Latina. Desarrollador Fernando Mire

Fernando Mireles 17 Dec 07, 2022
A python script to acquire multiple aws ec2 instances in a forensically sound-ish way

acquire_ec2.py The script acquire_ec2.py is used to automatically acquire AWS EC2 instances. The script needs to be run on an EC2 instance in the same

Deutsche Telekom Security GmbH 31 Sep 10, 2022
This bot will pull a stream of tweets based on rules you set and automatically reply to them.

Twitter reply bot This bot will pull a stream of tweets based on rules you set and automatically reply to them. I built this bot in order to help comb

Brains 1 Feb 13, 2022
New developed moderation discord bot by archisha

Monitor42 New developed moderation discord bot by αrchιshα#5518. Details Prefix: 42! Commands: Moderation Use 42!help to get command list. Invite http

Kamilla Youver 0 Jun 29, 2022
HelpDESK Dynamics

Helpdesk Application The project is a Helpdesk application (Helpdesk dynamics) where staff of an organization can raise and assign job/trouble tickets

Okeoma Ihunwo 0 Nov 14, 2021
Backend for Indipe client

Betsushi Betsu (別), the japanese word meaning "another" and Shiharai (支払い) meaning "payment". Hence the name Betsushi was derived. Introduction This i

Sudodevs 3 Feb 09, 2022
An api, written in Python, for Investopedia's paper trading stock simulator.

investopedia-trading-api An API, written in Python, for Investopedia's paper trading stock simulator. Pull requests welcome. This library is now Pytho

Kirk Thaker 178 Jan 06, 2023
HTTP Calls to Amazon Web Services Rest API for IoT Core Shadow Actions 💻🌐💡

aws-iot-shadow-rest-api HTTP Calls to Amazon Web Services Rest API for IoT Core Shadow Actions 💻 🌐 💡 This simple script implements the following aw

AIIIXIII 3 Jun 06, 2022
A modular telegram Python bot running on python3 with an sqlalchemy database.

Saber A modular telegram Python bot running on python3 with an sqlalchemy database. Originally a marie fork - Saber has evolved further and was built

ZERO • アクバル . 4 Nov 09, 2021
Telegram Vc Video Player Bot

Telegram Video Player Bot Telegram bot project for streaming video on telegram video chat, powered by tgcalls and pyrogram Deploy to Heroku 👨‍🔧 The

Dihan Official 11 Dec 25, 2022
Home Assistant for Opendata CWB. Get the weather forecast of the city in Taiwan.

Home assistant support for Opendata CWB. The readme in Traditional Chinese. This integration is based on OpenWeatherMap (@csparpa, pyowm) to develop.

11 Sep 30, 2022
Microsoft Azure Storage Library for Python

Microsoft Azure Storage Library for Python

Microsoft Azure 329 Dec 16, 2022
A Discord Token Spammer, multi webhooks compatibility, made in python +3.7. By Ezermoz

DiscordWebhookSpammer A Discord Token Spammer, multi webhooks compatibility, made in python +3.7. By Ezermoz Put you webhook in webhooks.txt if you wa

3 Nov 24, 2021
NFTs Upload to OpenSea CuseEdition

NFTs-Upload-to-OpenSea-CuseEdition YOUTUBE VIDEO - Soon... Download Python and

Lil Cuse 2 Jan 04, 2022
Okaeri Robot: a modular bot running on python3 with anime theme and have a lot features

OKAERI ROBOT Okaeri Robot is a modular bot running on python3 with anime theme a

Dream Garden (rey) 2 Jan 19, 2022
Connect your Nintendo Switch playing status to Discord!

Disclaimer: Unfortunately, it appears that Nintendo has removed returning self-Presence in their API as of recently, making this project near obsolete

Deltaion Lee 145 Dec 30, 2022
An Simple Advance Auto Filter Bot Complete Rewritten Version Of Adv-Filter-Bot

Adv Auto Filter Bot V2 This Is Just An Simple Advance Auto Filter Bot Complete Rewritten Version Of Adv-Filter-Bot.. Just Sent Any Text As Query It Wi

0 Dec 18, 2021
Anchor Protocol Script that can save you from being liquidated!

Why My day job requires a fairly good amount of automation from time to time. Besides, I do like computers to work on what I cannot while I'm sleeping

126 Oct 16, 2022