A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
Trading bot that uses Elon Musk`s tweets to know when to buy cryptocurrency.

Elonbot Trading bot that uses Elon Musk`s tweets to know when to buy cryptocurrency. Here is how it works: Subscribes to someone's (elonmusk?) tweets

153 Dec 23, 2022
Easy & powerful bot to check if your all Telegram bots are working or not

Easy & powerful bot to check if your all Telegram bots are working or not. This bot status bot updates every 105 minutes & runs for 24x7 hours.

35 Dec 30, 2022
Template to create a telegram bot in python

Template for Telegram Bot Template to create a telegram bot in python. How to Run Set your telegram bot token as environment variable TELEGRAM_BOT_TOK

Ali Hejazizo 12 Aug 14, 2022
A module to get data about anime characters, news, info, lyrics and more.

Animec A module to get data about anime characters, news, info, lyrics and more. The module scrapes myanimelist to parse requested data. If you wish t

DriftAsimov 31 Aug 31, 2022
Open Source API and interchange format for editorial timeline information.

OpenTimelineIO is currently in Public Beta. That means that it may be missing some essential features and there are large changes planned. During this phase we actively encourage you to provide feedb

Pixar Animation Studios 1.2k Jan 01, 2023
YuuScource - A Discord bot made with Pycord

Yuu A Discord bot made with Pycord Features Not much lol • Easy to use commands

kekda 9 Feb 17, 2022
To send an Instagram message using Python

To send an Instagram message using Python, you must have an Instagram account and install the Instabot library in your Python virtual environment.

Coding Taggers 1 Dec 18, 2021
This python cheat utilizes PyMeow, PyMem, and others to enhance your CS:GO experience ;).

CSGO-Python-Cheat This python cheat utilizes PyMeow, PyMem, and others to enhance your CS:GO experience ;). Features Esp Tracers Chams (More to come)

Addi 1 Nov 30, 2021
A tool for creating credentials for accessing S3 buckets

s3-credentials A tool for creating credentials for accessing S3 buckets For project background, see s3-credentials: a tool for creating credentials fo

Simon Willison 138 Jan 06, 2023
A way to export your saved reddit posts to a Notion table.

reddit-saved-to-notion A way to export your saved reddit posts and comments to a Notion table.Uses notion-sdk-py and praw for interacting with Notion

19 Sep 12, 2022
Osmopy - osmo python client library

osmopy Version 0.0.2 Tools for Osmosis wallet management and offline transaction

5 May 22, 2022
Ulaavi for nuke, helps to keep our stocl elements organised.

Ulaavi Ulaavi for nuke, helps to keep our stock elements organised. Installation Downlaod ffmpeg from ffmpeg.org linux : https://johnvansickle.com/ffm

Arun Subramaniyam 17 Aug 24, 2022
Python script to harvest tweets with the Twitter API V2 Academic Research Product Track

Tweet harvester Python script to scrape, collect, and/or harvest tweets with the Twitter API V2 Academic Research Product Track. Important note: In or

Thomas Frissen 2 Nov 11, 2021
Бот для скачивания треков с Deezer используя ISRC и UPC коды

deez_robot Запуск Установите необходимые библиотеки pip install -r requirements.txt Создайте файл config.py и поместите туда токен бота и ARL-токен De

Max 4 Jul 31, 2022
🤖 A fully featured, easy to use Python wrapper for the Walmart Open API

Wapy Wapy is a fully featured Python wrapper for the Walmart Open API. Features Easy to use, object oriented interface to the Walmart Open API. (Produ

Carlos Roso 43 Oct 14, 2022
Python Twitter API

Python Twitter Tools The Minimalist Twitter API for Python is a Python API for Twitter, everyone's favorite Web 2.0 Facebook-style status updater for

Mike Verdone 2.9k Jan 03, 2023
A wrapper for The Movie Database API v3 and v4 that only uses the read access token (not api key).

fulltmdb A wrapper for The Movie Database API v3 and v4 that only uses the read access token (not api key). Installation Use the package manager pip t

Jacob Hale 2 Sep 26, 2021
A Telegram bot that can stream Telegram files to users over HTTP.

T.ME_FILE_TO_LINK Hi iam a file to link bot....best Bot telegram Telegram File To Link Generation Bot A Telegram bot that can stream Telegram files to

1 Oct 24, 2021
How to add reaction on message discord.py

BA / HR / RS: Python (discord.py) skripta pomocu koje dodajete reakciju na vasu poruku putem komande !v ili da se dodaje samo u nekoj odredjenoj sobi.

Seekii 3 Dec 23, 2021
Huggingface transformers for discord

disformers Huggingface transformers for discord base source butyr/huggingface-transformer-chatbots install pip install -U disformers example see examp

SpaceDEVofficial 1 Nov 09, 2021