A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
Block Telegram's new

Telegram Channel Blocker Bot Channel go away! This bot is used to delete and ban message sent by channel How this appears? The reason this appears ple

16 Feb 15, 2022
Quack-SMS-BOMBER - Quack Toolkit By IkigaiHack

Quack Toolkit By IkigaiHack About Quack Toolkit Quack Toolkit is a set of tools

Marcel 2 Aug 19, 2022
This is a bot which you can use in telegram to spam without flooding and enjoy being in the leaderboard

Telegram-Count-spamming-Bot This is a bot which you can use in telegram to spam without flooding and enjoy being in the leaderboard You can avoid the

Lalan Kumar 1 Oct 23, 2021
PRNT.sc Image Grabber

PRNTSender PRNT.sc Image Grabber PRNTSender is a script that takes images posted on PRNT.sc and sends them to a Discord webhook, if you want to know h

neox 2 Dec 10, 2021
A Bot Telegram Anti Users Channel to automatic ban users who using channel to send message in group.

Tg_Anti_UsersChannel A Bot Telegram Anti Users Channel to automatic ban users who using channel to send message in group. Features: Automatic ban Whit

idzeroid 6 Dec 26, 2021
WatonAPI is an API used to connect to spigot servers with the WatonPlugin to communicate.

WatonAPI is an API used to connect to spigot servers with the WatonPlugin to communicate. You can send messages to the server and read messages, making it useful for cross-chat programs.

Waton 1 Nov 22, 2021
It connects to Telegram's API. It generates JSON files containing channel's data, including channel's information and posts.

It connects to Telegram's API. It generates JSON files containing channel's data, including channel's information and posts. You can search for a specific channel, or a set of channels provided in a

Esteban Ponce de Leon 75 Jan 02, 2023
→ Comando Básico para Python Discord

Discord.py · Código @client.event async def on_ready(): print('He iniciado sessión en: {0.user}'.format(client)) @client.event async def on_messa

Panda.xyz 4 Mar 12, 2022
DIAL(Did I Alert Lambda?) is a centralised security misconfiguration detection framework which completely runs on AWS Managed services like AWS API Gateway, AWS Event Bridge & AWS Lambda

DIAL(Did I Alert Lambda?) is a centralised security misconfiguration detection framework which completely runs on AWS Managed services like AWS API Gateway, AWS Event Bridge & AWS Lambda

CRED 71 Dec 29, 2022
A script to forward mass number of media to another group/channel. Heroku deploy

Telegram Forward Script 😇 This is a Script to Forward Large Number of Files to Another Telegram Channel. Star එකක් දාල fork එකක් ගහපියව් 🥴 If You Tr

Anjana Madu 17 Oct 21, 2022
A Telegram Bot which will ask new Group Members to verify them by solving an emoji captcha.

Emoji-Captcha-Bot A Telegram Bot which will ask new Group Members to verify them by solving an emoji captcha. About API: Using api.abirhasan.wtf/captc

Abir Hasan 52 Dec 11, 2022
Changes the Telegram bio, profile picture, first and last name to the song that the user is currently listening to.

TGBIOFY - Telegram & Spotify integration Changes the Telegram bio, profile picture, first and last name to the song that the user is currently listeni

elpideus 26 Dec 07, 2022
Clisd.py - UI framework with client side rendering for python

clisd.py Clisd is UI framework with client side rendering for python. It uses WA

2 Mar 25, 2022
ВКонтакте бот для управления Sugar кошельком

Sugarchain VK ВКонтакте бот для управления Sugar кошельком Установка Установить зависимости можно командой: pip install -r requirements.txt Запуск (из

Vladimir 4 Jun 06, 2021
Visualize size of directories, s3 buckets.

Dir Sizer This is a work in progress, right now consider this an Alpha or Proof of Concept level. dir_sizer is a utility to visualize the size of a di

Scott Seligman 13 Dec 08, 2022
A Code that can make your Discord Account 24/7 on Voice Channels!

Voicecord Make your Discord Account Online 24/7 on Voice Channels! A Code written in Python that helps you to keep your account 24/7 on Voice Channels

Phantom 229 Jan 07, 2023
The aim is to contain multiple models for materials discovery under a common interface

Aviary The aviary contains: - roost, - wren, cgcnn. The aim is to contain multiple models for materials discovery under a common interface Environment

Rhys Goodall 20 Jan 06, 2023
TORNADO CASH Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX)

TORNADO CASH Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX)

Crypto Trader 1 Jan 06, 2022
WhatsApp Status Tracker With Python

Warning!! This Repo is Purly educational purpose Don't use this to stalk on others, which is subjective to crime Pre-Req: Telegram bot of your own wit

Vignesh Karunagaran 10 Dec 09, 2022
Send song lyrics to iMessage users using the Genius lyrics API

pyMessage Send song lyrics to iMessage users using the Genius lyrics API. Setup 1.) Open the main.py file, and add your API key on line 7. 2.) Install

therealkingnull 1 Jan 23, 2022