AWS Glue PySpark - Apache Hudi Quick Start Guide

Overview

AWS Glue PySpark - Apache Hudi Quick Start Guide

Disclaimer:

This is a quick start guide for the Apache Hudi Python Spark connector, running on AWS Glue.

It's also specifically configured for the following Glue version:

  • AWS Glue 3.0
    • Spark 3.1.1
    • Python 3.7

Glue Configuration Reference: https://docs.aws.amazon.com/glue/latest/dg/add-job.html

Apache Hudi Reference: https://hudi.apache.org/docs/quick-start-guide/ for more information

Prerequisites:

- Python 3.6 or higher
- AWS CLI - Profile named 'dev' with Administrator Access (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html)

Folder Structure:

glue-hudi-hello
├── README.md
├── cloud-formation
│   ├── command.md
│   └── GlueJobPySparkHudi.yaml
├── jars
│   ├── command.md
│   ├── hudi-spark3-bundle_2.12-0.9.0.jar
│   └── spark-avro_2.12-3.0.1.jar
├── job
│   ├── command.md
│   └── job.py
│   └── upload_job.py
├── requirements.txt

Step 1: Create and activate a virtualenv:

Create a new virtual environment for the project in its root directory:

python3 -m venv venv

Activate it:

source venv/bin/activate

Run from the root directory the pip install to get boto3.

pip install -r requirements.txt

Step 2: Create the AWS Resources:

Now, with a aws configured profile named as dev, cd into the cloud-formation folder and run the command in command.md.

As a AWS Cloud Formation exercise, read the command Parameters and how they are used on the GlueJobPySparkHudi.yaml file to dynamically create the Glue Job and S3 Bucket.

Step 3: Upload the Job and Jars to S3:

cd into the job folder and run the command in command.md.

cd into the jars folder and run the commands in command.md. Note: There is one command for each jar.

Step 4: Check AWS Resources results:

Log into aws console and check the Glue Job and S3 Bucket.

On the AWS Glue console, you can run the Glue Job by clicking on the job name.

After the job is finished, you can check the Glue Data Catalog and query the new database from AWS Athena.

On AWS Athena check for the database: hudi_demo and for the table: hudi_trips.

Owner
Gabriel Amazonas Mesquita
Gabriel Amazonas Mesquita
Gnosis-py includes a set of libraries to work with Ethereum and Gnosis projects

Gnosis-py Gnosis-py includes a set of libraries to work with Ethereum and Gnosis projects: EthereumClient, a wrapper over Web3.py Web3 client includin

Gnosis 93 Dec 23, 2022
Project glow is an open source bot worked on by many people to create a good and safe moderation bot for all

Project Glow Greetings, I see you have stumbled upon project glow. Project glow is an open source bot worked on by many people to create a good and sa

Glowstikk 24 Sep 29, 2022
Python written Rule34 API

Python written Rule34 API

1 Nov 11, 2021
My telegram bot to download Instagram Profiles

Instagram Profile Get for Telegram My telegram bot to download Instagram Profiles First you have to get a telegrm bot api key from @BotFather Then you

Ali Yoonesi 2 Sep 22, 2022
A fully decentralized protocol for private transactions FAST snipe BUY token on LUANCH after add LIQUIDITY

TORNADO CASH Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX) ⭐️ A fully decentralized protocol for private and safe transactions ⭐️ AUTO DO

Crypto Trader 2 Jan 06, 2022
A program that automates the boring parts of completing the Daily accounting spreadsheet at Taos Ski Valley

TSV_Daily_App A program that automates the boring parts of completing the Daily accounting spreadsheet at my old job. To see how it works you will nee

Devin Beck 2 Jan 01, 2022
A python tool to Automate Whatsapp through Whatsapp web

This python tool is used to Automate Whatsapp through Whatsapp web. We can add number of contacts whom we want to send text messages on perticular time

5 Jul 21, 2022
MCNameBot is a fast discord bot that is used to check the availability of a Minecraft name with a simple command.

MCNameBot MCNameBot is a fast discord bot that is used to check the availability of a Minecraft name with a simple command. If you would like to just

Killin 2 Oct 11, 2022
A pixeldrain python package using pixeldrain official api

Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.com/FayasNoushad/Pixeldrain/blob/main/LICENSE In

Fayas Noushad 6 Jan 26, 2022
A simple telegram Bot, Upload Media File| video To telegram using the direct download link. (youtube, Mediafire, google drive, mega drive, etc)

URL-Uploader (Bot) A Bot Upload file|video To Telegram using given Links. Features: 👉 Only Auth Users (AUTH_USERS) Can Use The Bot 👉 Upload YTDL Sup

Hash Minner 18 Dec 17, 2022
Git Plan - a better workflow for git

git plan A better workflow for git. Git plan inverts the git workflow so that you can write your commit message first, before you start writing code.

Rory Byrne 178 Dec 11, 2022
Python client for Arista eAPI

Arista eAPI Python Library The Python library for Arista's eAPI command API implementation provides a client API work using eAPI and communicating wit

Arista Networks EOS+ 124 Nov 23, 2022
AWS Auto Inventory allows you to quickly and easily generate inventory reports of your AWS resources.

Photo by Denny Müller on Unsplash AWS Automated Inventory ( aws-auto-inventory ) Automates creation of detailed inventories from AWS resources. Table

AWS Samples 123 Dec 26, 2022
An API that uses NLP and AI to let you predict possible diseases and symptoms based on a prompt of what you're feeling.

Disease detection API for MediSearch An API that uses NLP and AI to let you predict possible diseases and symptoms based on a prompt of what you're fe

Sebastian Ponce 1 Jan 15, 2022
This is a Discord script that will provide a QR Code to your scholars for Axie Infinity.

DiscordQRCodeBot This is a Discord script that will provide a QR Code to your Axie Infinity scholars. Setup Run Ubuntu on AWS ec2 instance Dowloads al

ZracheSs | xyZ 24 Oct 05, 2022
AWS SQS event redrive Lambda With Python

AWS SQS event redrive Lambda This repository contains one simple AWS Lambda function in Python to redrive AWS SQS events from source queue to destinat

1 Oct 19, 2021
Source Code for our bot that manages time and other functions of the server <3

Komi San wants you to study This repo contains the source code for our bot that manages time and other functions of the server 3 Features Your study

Komi San wants you to study 8 Nov 08, 2021
Trading bot - A Trading bot With Python

Trading_bot Trading bot intended for 1) Tracking current prices of tokens 2) Set

Tymur Kotkov 29 Dec 01, 2022
How to add reaction on message discord.py

BA / HR / RS: Python (discord.py) skripta pomocu koje dodajete reakciju na vasu poruku putem komande !v ili da se dodaje samo u nekoj odredjenoj sobi.

Seekii 3 Dec 23, 2021
MemeBot - A discord bot that tracks how good people's memes are

MemeBot A discord Meme "Karma" Tracking bot Dependancies Make sure you have pymongo installed and a mongodb cluster setup with two collections. pip in

Uday Sharma 3 Aug 10, 2022