Data Orchestration Platform

Related tags

Miscellaneousdop
Overview

Table of contents

What is DOP

Design Concept

DOP is designed to simplify the orchestration effort across many connected components using a configuration file without the need to write any code. We have a vision to make orchestration easier to manage and more accessible to a wider group of people.

Here are some of the key design concept behind DOP,

  • Built on top of Apache Airflow - Utilises it’s DAG capabilities with interactive GUI
  • DAGs without code - YAML + SQL
  • Native capabilities (SQL) - Materialisation, Assertion and Invocation
  • Extensible via plugins - DBT job, Spark job, Egress job, Triggers, etc
  • Easy to setup and deploy - fully automated dev environment and easy to deploy
  • Open Source - open sourced under the MIT license

Please note that this project is heavily optimised to run with GCP (Google Cloud Platform) services which is our current focus. By focusing on one cloud provider, it allows us to really improve on end user experience through automation

A Typical DOP Orchestration Flow

Typical DOP Flow

Prerequisites - Run in Docker

Note that all the IAM related prerequisites will be available as a Terraform template soon!

For DOP Native Features

  1. Download and install Docker https://docs.docker.com/get-docker/ (if you are on Windows, please follow instruction here as there are some additional steps required for it to work https://docs.docker.com/docker-for-windows/install/)
  2. Download and install Google Cloud Platform (GCP) SDK following instructions here https://cloud.google.com/sdk/docs/install.
  3. Create a dedicated service account for docker with limited permissions for the development GCP project, the Docker instance is not designed to be connected to the production environment
    1. Call it dop-docker-user@<your GCP project id> and create it in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  4. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on thedevelopment project just for the dop-docker-user service account in order to enable Service Account Impersonation.
    Grant service account user
  5. Authenticating with your GCP environment by typing in gcloud auth application-default login in your terminal and following instructions. Make sure you proceed to the stage where application_default_credentials.json is created on your machine (For windows users, make a note of the path, this will be required on a later stage)
  6. Clone this repository to your machine.

For DBT

  1. Setup a service account for your GCP project called dop-dbt-user in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
  2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  3. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on the development project just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Instructions for Setting things up

Run Airflow with DOP in Docker - Mac

See README in the service project setup and follow instructions.

Once it's setup, you should see example DOP DAGs such as dop__example_covid19 Airflow in Docker

Run Airflow with DOP in Docker - Windows

This is currently working in progress, however the instructions on what needs to be done is in the Makefile

Run on Composer

Prerequisites

  1. Create a dedicate service account for Composer and call it dop-composer-user with following roles at project level
    • roles/bigquery.dataEditor
    • roles/bigquery.jobUser
    • roles/composer.worker
    • roles/compute.viewer
  2. Create a dedicated service account for DBT with limited permissions.
    1. [Already done in here if it’s DEV] Call it dop-dbt-user@<GCP project id> and create in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. [Already done in here if it’s DEV] Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
    3. The dop-composer-user will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Create Composer Cluster

  1. Use the service account already created dop-composer-user instead of the default service account
  2. Use the following environment variables
    DOP_PROJECT_ID : {REPLACE WITH THE GCP PROJECT ID WHERE DOP WILL PERSIST ALL DATA TO}
    DOP_LOCATION : {REPLACE WITH GCP REGION LOCATION WHRE DOP WILL PERSIST ALL DATA TO}
    DOP_SERVICE_PROJECT_PATH := {REPLACE WITH THE ABSOLUTE PATH OF THE Service Project, i.e. /home/airflow/gcs/dags/dop_{service project name}
    DOP_INFRA_PROJECT_ID := {REPLACE WITH THE GCP INFRASTRUCTURE PROJECT ID WHERE BUILD ARTIFACTS ARE STORED, i.e. a DBT docker image stored in GCR}
    
    and optionally
    DOP_GCR_PULL_SECRET_NAME:= {This maybe needed if the project storing the gcr images are not he same as where Cloud Composer runs, however this might be a better alternative https://medium.com/google-cloud/using-single-docker-repository-with-multiple-gke-projects-1672689f780c}
    
  3. Add the following Python Packages
    dataclasses==0.7
    
  4. Finally create a new node pool with the following k8 label
    key: cloud.google.com/gke-nodepool
    value: kubernetes-task-pool
    

Deployment

See Service Project README

Misc

Service Account Impersonation

Impersonation is a GCP feature allows a user / service account to impersonate as another service account.
This is a very useful feature and offers the following benefits

  • When doing development locally, especially with automation involved (i.e using Docker), it is very risky to interact with GCP services by using your user account directly because it may have a lot of permissions. By impersonate as another service account with less permissions, it is a lot safer (least privilege)
  • There is no credential needs to be downloaded, all permissions are linked to the user account. If an employee leaves the company, access to GCP will be revoked immediately because the impersonation process is no longer possible

The following diagram explains how we use Impersonation in DOP when it runs in Docker DOP Docker Account Impersonation

And when running DBT jobs on production, we are also using this technique to use the composer service account to impersonate as the dop-dbt-user service account so that service account keys are not required.

There are two very google articles explaining how impersonation works and why using it

You might also like...
Cross-platform config and manager for click console utilities.

climan Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https

YourCity is a platform to match people to their prefect city.
YourCity is a platform to match people to their prefect city.

YourCity YourCity is a city matching App that matches users to their ideal city. It is a fullstack React App made with a Redux state manager and a bac

A multi-platform fuzzer for poking at userland binaries and servers

litefuzz A multi-platform fuzzer for poking at userland binaries and servers litefuzz intro why how it works what it does what it doesn't do support p

A platform for developers 👩‍💻  who wants to share their programs and projects.
A platform for developers 👩‍💻 who wants to share their programs and projects.

Fest-Practice-2021 This project is excluded from Hacktoberfest 2021. Please use this as a testing repo/project. A platform for developers 👩‍💻 who wa

Speed up your typing by some exercises in the multi-platform(Windows/Ubuntu).

Introduction This project purpose is speed up your typing by some exercises in the multi-platform(Windows/Ubuntu). Build Environment Software Environm

An Airdrop alternative for cross-platform users only for desktop with Python

PyDrop An Airdrop alternative for cross-platform users only for desktop with Python, -version 1.0 with less effort, just as a practice. ##############

Platform Tree for Xiaomi Redmi Note 7/7S (lavender)
Platform Tree for Xiaomi Redmi Note 7/7S (lavender)

The Xiaomi Redmi Note 7 (codenamed "lavender") is a mid-range smartphone from Xiaomi announced in January 2019. Device specifications Device Xiaomi Re

A Classroom Engagement Platform

Project Introduction This is project introduction Setup Setting up Postgres This is the most tricky part when setting up the application. You will nee

Traffic flow test platform, especially for reinforcement learning
Traffic flow test platform, especially for reinforcement learning

Traffic Flow Test Platform Traffic flow test platform, especially for reinforcement learning, named TFTP. A traffic signal control framework that can

Comments
  • Release DOP v0.3.0

    Release DOP v0.3.0

    A number of new features where added in this version

    DOP v0.3.0 — 2021-08-11

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    opened by dinigo 0
Releases(v0.3.0)
  • v0.3.0(Aug 17, 2021)

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Mar 30, 2021)

Owner
Datatonic
We accelerate business impact through Machine Learning and Analytics
Datatonic
Pylexa - Artificial Assistant made with Python

Pylexa - Artificial Assistant made with Python Alexa is a famous artificial assistant used massively across the world. It is a substitute of Alexa whi

\_PROTIK_/ 4 Nov 03, 2021
Calculator in command line using python programming language

Calculator in command line using python programming language University of the People Python fundamental Chapter 5 Conditionals and recursion The main

mark sikaundi 3 Dec 09, 2021
This is the course project of AI3602: Data Mining of SJTU

This is the course project of AI3602: Data Mining of SJTU. Group Members include Jinghao Feng, Mingyang Jiang and Wenzhong Zheng.

2 Jan 13, 2022
1 May 12, 2022
Python tools for experimenting with differentiable intonation cost measures

Differentiable Intonation Tools The Differentiable Intonation Tools (dit) are a collection of Python functions to analyze the intonation in multitrack

Simon Schwär 2 Mar 27, 2022
Explore related sequences in the OEIS

OEIS explorer This is a tool for exploring two different kinds of relationships between sequences in the OEIS: mentions (links) of other sequences on

Alex Hall 6 Mar 15, 2022
Minterpy - Multidimensional interpolation in Python.

minterpy is an open-source Python package for a multivariate generalization of the classical Newton and Lagrange interpolation schemes as well as related tasks.

Center for Advanced Systems Understanding 18 Jan 06, 2023
Svg-turtle - Use the Python turtle to write SVG files

SaVaGe Turtle Use the Python turtle to write SVG files If you're using the Pytho

Don Kirkby 7 Dec 21, 2022
Library support get vocabulary from MEM

Features: Support scraping the courses in MEM to take the vocabulary Translate the words to your own language Get the IPA for the English course Insta

Joseph Quang 4 Aug 13, 2022
Lightweight and Modern kernel for VK Bots

This is the kernel for creating VK Bots written in Python 3.9

Yrvijo 4 Nov 21, 2021
Collection of tools to be more productive in your work environment and to avoid certain repetitive tasks. 💛💙💚

Collection of tools to be more productive in your work environment and to avoid certain repetitive tasks. 💛💙💚

Raja Rakotonirina 2 Jan 10, 2022
Terminal compatible with ansi-bbs. Meant to be a prototype, but published because why not.

pybbsterm: Terminal emulator for calling BBSs. Use cases (non-exhaustive) Explore terminal protocols. Connect to BBSs. Highlights Python 3.8+ code. Bu

Roc Vallès i Domènech 9 Apr 29, 2022
🌌A Python library to exhaustively enumerate a combinatorial space represented by a function

exhaust A Python library to exhaustively enumerate a combinatorial space represented by a function. The API is modelled after Python's random module a

Maik Riechert 1 Dec 05, 2021
sumCulator Это калькулятор, который умеет складывать 2 числа.

sumCulator Это калькулятор, который умеет складывать 2 числа. Но есть условия: Эти 2 числа не могут быть отрицательными (всё-таки это вычитание, а не

0 Jul 12, 2022
Xkcd.py - Script to generate wallpapers based on XKCD comics

xkcd.py Script to generate wallpapers based on XKCD comics Usage python3 xkcd.py

Gideon Wolfe 11 Sep 06, 2022
This is where I learn machine learning

This is where I learn machine learning🤷‍ This means that this repo covers no specific topic of machine learning or a project - I work in here when I want to learn/try something

Wilhelm Berghammer 47 Nov 16, 2022
Demo of a WAM Prolog implementation in Python

Prol: WAM demo This is a simplified Warren Abstract Machine (WAM) implementation for Prolog, that showcases the main instructions, compiling, register

Bruno Kim Medeiros Cesar 62 Dec 26, 2022
Toppr Os Auto Class Joiner

Toppr Os Auto Class Joiner Toppr os is a irritating platform to work with especially for students it takes a while and is problematic most of the time

1 Dec 18, 2021
PKU team for 2021 project 'Guangchangwu detection'.

PKU team for 2021 project 'Guangchangwu detection'.

Helin Wang 3 Feb 21, 2022