Template for a Dataflow Flex Template in Python

Overview

Dataflow Flex Template in Python

This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build Dataflow jobs to run in STOIX using Dataflow runner.

The code is based on the same example data as Google Cloud Python Quickstart, "King Lear" which is a tragedy written by William Shakespeare.

The Dataflow job reads the file content, count occurencies of each word and inserts it to a BigQuery table. The schedule date is also added to the table name producing a sharded table for the output.

Source data:

Template maintained by STOIX.

Configuration

The job is configured with the following pipeline options:

  • stoix_scheduled - Scheduled datetime as RFC3339
  • input_file - Text to read
  • output_dataset - BigQuery dataset for output table
  • output_table_prefix - BigQuery output table name prefix
  • project - Google Cloud project id

When using Dataflow runner, stoix_scheduled is automatically set and other pipeline options can be added as described in the Dataflow runner README.

Test the code

Tox is used to format, test and lint the code. Make sure to install it with pip install tox and then just run tox within the project folder.

Run pipeline

In order to work with the code locally, you can use Python virtual environments. Make sure to use Python version 3.7.10 as it is the version supported by Google Dataflow.

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -e .

Run on local machine

See quickstart python for further description of arguments.

python -m main \
    --region europe-north1 \
    --runner DirectRunner \
    --stoix_scheduled 2021-01-01T00:00:00Z \
    --input_file gs://dataflow-samples/shakespeare/kinglear.txt \
    --output_table_prefix kinglear \
    --output_dataset 
   
     \
    --project 
    
      \
    --temp_location gs://
     
      /tmp/

     
    
   

Build Docker image for STOIX

In order to run the pipeline the Flex Template needs to be packaged in a Docker image and pushed to a Docker image repository. In this example Docker Hub is used.

Set the tag to the name and version of your pipeline, e.g: stoix/count-words:1.0.0.

$ docker build --tag stoix/count-words:1.0.0 .

Then upload the image to the Docker image repository.

$ docker push stoix/count-words:1.0.0

Run Dataflow on STOIX

Now the Dataflow Flex Template job can be ran using Dataflow runner. Add a new job with the image stoix/dataflow-runner and the following environment variables:

  • GCP_PROJECT_ID:
  • GCP_REGION: europe-north1
  • GCP_SERVICE_ACCOUNT: BASE64 encoded service account JSON
  • JOB_IMAGE: stoix/count-words:1.0.0
  • JOB_NAME_PREFIX: count-words
  • JOB_PARAM_INPUT_FILE: gs://dataflow-samples/shakespeare/kinglear.txt
  • JOB_PARAM_OUTPUT_DATASET: dataflow
  • JOB_PARAM_OUTPUT_TABLE_PREFIX: kinglear
  • JOB_SDK_LANGUAGE: python

Note: When running this in production, set GCP_SERVICE_ACCOUNT as a secret instead of environment variable.

License

MIT

Owner
STOIX
STOIX
Binance Kline Data With Python

Binance Kline Data by seunghan(gingerthorp) reference https://github.com/binance/binance-public-data/ All intervals are supported: 1m, 3m, 5m, 15m, 30

shquant 5 Jul 13, 2022
Hg002-qc-snakemake - HG002 QC Snakemake

HG002 QC Snakemake To Run Resources and data specified within snakefile (hg002QC

Juniper A. Lake 2 Feb 16, 2022
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022
Nobel Data Analysis

Nobel_Data_Analysis This project is for analyzing a set of data about people who have won the Nobel Prize in different fields and different countries

Mohammed Hassan El Sayed 1 Jan 24, 2022
Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are st

32 Dec 20, 2022
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
Investigating EV charging data

Investigating EV charging data Introduction: Got an opportunity to work with a home monitoring technology company over the last 6 months whose goal wa

Yash 2 Apr 07, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021