Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview

Overview

dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apache Beam)

Table of Contents

File Description
main.py Main Python code for the Dataflow pipeline. The function defineBQSchema defines the BQ table schema
setup.py When the pipeline is deployed in GCP as a template, GCP uses setup.py to set up the worker nodes (e.g., install required Python dependencies).
build.bat Bash script to deploy the pipeline as a reusable template in GCP.

Environment

  • Local machine running Microsoft Windows 10 Home
  • Python 3.6.8
    • As of 12/1/21, Apache Beam only supports 3.6, 3.7, and 3.8 (not 3.9). However, orjson only supports 3.6.

Getting Started

Pre-Requisites

The following instructions assume that the project ID is dataflow-mvp and you have owner access to it.

  1. If you don't have it already, install the Google Cloud SDK:
    https://cloud.google.com/sdk/docs/install

  2. Authenticate your Google account:
    gcloud auth login

  3. Create a virtual environment for Python:
    py -3.8 venv venv

  4. Activate the virtual environment, upgrade pip, and install the Apache Beam library for GCP:

"./venv/Scripts/activate.bat"
python -m pip install --upgrade pip
python -m pip install apache_beam[gcp]

Run Build

  1. To make our lives easier later, set environment variables for the following:
  • DATAFLOW_BUCKET - the name of the bucket from step 10
  • DF_TEMPLATE_NAME - the name (of your choosing) for the Dataflow template (e.g., dataflow-mvp-dog)
  • PROJECT_ID - the name of the GCP project from step 4 (e.g., dataflow-mvp)
  • GCP_REGION - the GCP region (I like to choose the region closest to me e.g., useast-1)

For instance, to set the PROJECT_ID variable in the Windows CLI, use:
set PROJECT_ID=dataflow-mvp

On Linux machines, use
export PROJECT_ID=dataflow-mvp

The instructions below assume you're working on a Windows machine. Therefore, if you're working in a Linux environment, you'll have to use $PROJECT_ID instead of %PROJECT_ID% where appropriate in the instructions below.

  1. Set the GCP project via config:
    gcloud config set project %PROJECT_ID%
  • You can verify the project is correctly set using:
    gcloud config list
  1. Enable the necessary APIs:
gcloud services enable dataflow.googleapis.com && ^
gcloud services enable cloudscheduler.googleapis.com && ^
gcloud services enable bigquery.googleapis.com && ^
gcloud services enable cloudresourcemanager.googleapis.com  && ^
gcloud services enable appengine.googleapis.com
  1. Create a service account for the Dataflow runner:
gcloud iam service-accounts create dataflow-runner --display-name "Dataflow Runner service account"
  1. Add the required IAM roles to the Dataflow runner's service account:
gcloud projects add-iam-policy-binding %PROJECT_ID% --member serviceAccount:dataflow-runner@%PROJECT_ID%.iam.gserviceaccount.com --role roles/owner
  1. Create a GCS bucket to store Dataflow code, staging files and templates:
gsutil mb -p %PROJECT_ID% -l %GCP_REGION% gs://%DATAFLOW_BUCKET%

Build the Dataflow Template

  1. In build.bat, edit the variables in lines 1 through 4:
  • DATAFLOW_BUCKET - the name of the bucket from step 10
  • DF_TEMPLATE_NAME - the name (of your choosing) for the Dataflow template (e.g., dataflow-mvp-dog)
  • PROJECT_ID - the name of the GCP project from step 4 (e.g., dataflow-mvp)
  • GCP_REGION - the GCP region (I like to choose the region closest to me e.g., useast-1)
  1. Run the build.bat script:
build.bat

This will create the template for the Dataflow job in a the specified GCS bucket.

  1. Verify that the template has been uploaded to the GCS bucket:
    gsutil ls gs://%DATAFLOW_BUCKET%/templates/%DF_TEMPLATE_NAME%

Create the Cloud Scheduler Job

  1. Finally, submit a Cloud Scheduler job to run Dataflow on a desired schedule:
gcloud scheduler jobs create http api-to-gbq-scheduler ^
--schedule="0 */3 * * *" ^
--uri="https://dataflow.googleapis.com/v1b3/projects/%PROJECT_ID%/locations/%GCP_REGION%/templates:launch?gcsPath=gs://%DATAFLOW_BUCKET%/templates/%DF_TEMPLATE_NAME%" ^
--http-method="post" ^
--oauth-service-account-email="dataflow-runner@%PROJECT_ID%.iam.gserviceaccount.com" ^
--oauth-token-scope="https://www.googleapis.com/auth/cloud-platform" ^
--message-body="{""jobName"": ""api-to-bq-df"", ""parameters"": {""region"": ""%GCP_REGION%""}, ""environment"": {""numWorkers"": ""3""}}" ^
--time-zone=America/Chicago 

Notes:

  • Alternatively, you could use the message-body-from-file argument. However, you'll need to manually specify the GCP region since we can't use environment variables within the JSON.
  • The cron string 0 */3 * * * executes the job every 3 hours.
  • The jobName parameter, api-to-bq-df, names the job as it will be listed in the Cloud Scheduler app.

Resources

Warranty

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Owner
Chris Carbonell
Chris Carbonell
Collections of pydantic models

pydantic-collections The pydantic-collections package provides BaseCollectionModel class that allows you to manipulate collections of pydantic models

Roman Snegirev 20 Dec 26, 2022
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

📈 Statistical Quality Control 📉 This repo contains a simple but effective tool made using python which can be used for quality control in statistica

SasiVatsal 8 Oct 18, 2022
Average time per match by division

HW_02 Unzip matches.rar to access .json files for matches. Get an API key to access their data at: https://developer.riotgames.com/ Average time per m

11 Jan 07, 2022
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
Common bioinformatics database construction

biodb Common bioinformatics database construction 1.taxonomy (Substance classification database) Download the database wget -c https://ftp.ncbi.nlm.ni

sy520 2 Jan 04, 2022
A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

AWS Samples 3 Oct 30, 2021
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
Data collection, enhancement, and metrics calculation.

l3_data_collection Data collection, enhancement, and metrics calculation. Summary Repository containing code for QuantDAO's JDT data collection task.

Ruiwyn 3 Dec 23, 2022
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022
This is a repo documenting the best practices in PySpark.

Spark-Syntax This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark f

Eric Xiao 447 Dec 25, 2022
Data cleaning tools for Business analysis

Datacleaning datacleaning tools for Business analysis This program is made for Vicky's work. You can use it, too. 数据清洗 该数据清洗工具是为了商业分析 这个程序是为了Vicky的工作而

Lin Jian 3 Nov 16, 2021
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 03, 2023
MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

Unidata 971 Dec 25, 2022
Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

Samuel 1 Jan 04, 2022
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
ASTR 302: Python for Astronomy (Winter '22)

ASTR 302, Winter 2022, University of Washington: Python for Astronomy Mario Jurić Location When: 2:30-3:50, Monday & Wednesday, Winter quarter 2022 Wh

UW ASTR 302: Python for Astronomy 4 Jan 12, 2022
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

The yt project 304 Jan 02, 2023