redun aims to be a more expressive and efficient workflow framework

Overview

redun

yet another redundant workflow engine

redun aims to be a more expressive and efficient workflow framework, built on top of the popular Python programming language. It takes the somewhat contrarian view that writing dataflows directly is unnecessarily restrictive, and by doing so we lose abstractions we have come to rely on in most modern high-level languages (control flow, compositiblity, recursion, high order functions, etc). redun's key insight is that workflows can be expressed as lazy expressions, that are then evaluated by a scheduler which performs automatic parallelization, caching, and data provenance logging.

redun's key features are:

  • Workflows are defined by lazy expressions that when evaluated emit dynamic directed acyclic graphs (DAGs), enabling complex data flows.
  • Incremental computation that is reactive to both data changes as well as code changes.
  • Workflow tasks can be executed on a variety of compute backend (threads, processes, AWS batch jobs, Spark jobs, etc).
  • Data changes are detected for in memory values as well as external data sources such as files and object stores using file hashing.
  • Code changes are detected by hashing individual Python functions and comparing against historical call graph recordings.
  • Past intermediate results are cached centrally and reused across workflows.
  • Past call graphs can be used as a data lineage record and can be queried for debugging and auditing.

See the docs, tutorial, and influences for more.

About the name: The name "redun" is self deprecating (there are A LOT of workflow engines), but it is also a reference to its original inspiration, the redo build system.

Install

pip install redun

See developing for more information on working with the code.

Postgres backend

To use postgres as a recording backend, use

pip install redun[postgres]

The above assumes the following dependencies are installed:

  • pg_config (in the postgresql-devel package; on ubuntu: apt-get install libpq-dev)
  • gcc (on ubuntu or similar sudo apt-get install gcc)

Use cases

redun's general approach to defining workflows makes it a good choice for implementing workflows for a wide-variety of use cases:

Small taste

Here is a quick example of using redun for a familar workflow, compiling a C program (full example). In general, any kind of data processing could be done within each task (e.g. reading and writing CSVs, DataFrames, databases, APIs).

File: """ Compile one C file into an object file. """ os.system(f"gcc -c {c_file.path}") return File(c_file.path.replace(".c", ".o")) @task() def link(prog_path: str, o_files: List[File]) -> File: """ Link several object files together into one program. """ o_files=" ".join(o_file.path for o_file in o_files) os.system(f"gcc -o {prog_path} {o_files}") return File(prog_path) @task() def make_prog(prog_path: str, c_files: List[File]) -> File: """ Compile one program from its source C files. """ o_files = [ compile(c_file) for c_file in c_files ] prog_file = link(prog_path, o_files) return prog_file # Definition of programs and their source C files. files = { "prog": [ File("prog.c"), File("lib.c"), ], "prog2": [ File("prog2.c"), File("lib.c"), ], } @task() def make(files : Dict[str, List[File]] = files) -> List[File]: """ Top-level task for compiling all the programs in the project. """ progs = [ make_prog(prog_path, c_files) for prog_path, c_files in files.items() ] return progs ">
# make.py

import os
from typing import Dict, List

from redun import task, File


redun_namespace = "redun.examples.compile"


@task()
def compile(c_file: File) -> File:
    """
    Compile one C file into an object file.
    """
    os.system(f"gcc -c {c_file.path}")
    return File(c_file.path.replace(".c", ".o"))


@task()
def link(prog_path: str, o_files: List[File]) -> File:
    """
    Link several object files together into one program.
    """
    o_files=" ".join(o_file.path for o_file in o_files)
    os.system(f"gcc -o {prog_path} {o_files}")
    return File(prog_path)


@task()
def make_prog(prog_path: str, c_files: List[File]) -> File:
    """
    Compile one program from its source C files.
    """
    o_files = [
        compile(c_file)
        for c_file in c_files
    ]
    prog_file = link(prog_path, o_files)
    return prog_file


# Definition of programs and their source C files.
files = {
    "prog": [
        File("prog.c"),
        File("lib.c"),
    ],
    "prog2": [
        File("prog2.c"),
        File("lib.c"),
    ],
}


@task()
def make(files : Dict[str, List[File]] = files) -> List[File]:
    """
    Top-level task for compiling all the programs in the project.
    """
    progs = [
        make_prog(prog_path, c_files)
        for prog_path, c_files in files.items()
    ]
    return progs

Notice, that besides the @task decorator, the code follows typical Python conventions and is organized like a sequential program.

We can run the workflow using the redun run command:

redun run make.py make

[redun] redun :: version 0.4.15
[redun] config dir: /Users/rasmus/projects/redun/examples/compile/.redun
[redun] Upgrading db from version -1.0 to 2.0...
[redun] Start Execution 69c40fe5-c081-4ca6-b232-e56a0a679d42:  redun run make.py make
[redun] Run    Job 72bdb973:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)], 'prog2': [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)]}) on default
[redun] Run    Job 096be12b:  redun.examples.compile.make_prog(prog_path='prog', c_files=[File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)]) on default
[redun] Run    Job 32ed5cf8:  redun.examples.compile.make_prog(prog_path='prog2', c_files=[File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)]) on default
[redun] Run    Job dfdd2ee2:  redun.examples.compile.compile(c_file=File(path=prog.c, hash=dfa3aba7)) on default
[redun] Run    Job 225f924d:  redun.examples.compile.compile(c_file=File(path=lib.c, hash=a2e6cbd9)) on default
[redun] Run    Job 3f9ea7ae:  redun.examples.compile.compile(c_file=File(path=prog2.c, hash=c748e4c7)) on default
[redun] Run    Job a8b21ec0:  redun.examples.compile.link(prog_path='prog', o_files=[File(path=prog.o, hash=4934098e), File(path=lib.o, hash=7caa7f9c)]) on default
[redun] Run    Job 5707a358:  redun.examples.compile.link(prog_path='prog2', o_files=[File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=7caa7f9c)]) on default
[redun]
[redun] | JOB STATUS 2021/06/18 10:34:29
[redun] | TASK                             PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                    0       0       0       0       8       8
[redun] | redun.examples.compile.compile         0       0       0       0       3       3
[redun] | redun.examples.compile.link            0       0       0       0       2       2
[redun] | redun.examples.compile.make            0       0       0       0       1       1
[redun] | redun.examples.compile.make_prog       0       0       0       0       2       2
[redun]
[File(path=prog, hash=a8d14a5e), File(path=prog2, hash=04bfff2f)]

This should have taken three C source files (lib.c, prog.c, and prog2.c), compiled them to three object files (lib.o, prog.o, prog2.o), and then linked them into two binaries (prog and prog2). Specifically, redun automatically determined the following dataflow DAG and performed the compiling and linking steps in separate threads:

Using the redun log command, we can see the full job tree of the most recent execution (denoted -):

redun log -

Exec 69c40fe5-c081-4ca6-b232-e56a0a679d42 [ DONE ] 2021-06-18 10:34:28:  run make.py make
Duration: 0:00:01.47

Jobs: 8 (DONE: 8, CACHED: 0, FAILED: 0)
--------------------------------------------------------------------------------
Job 72bdb973 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)], 'prog2': [File(path=prog2.c, hash=c748e4c7), Fil
  Job 096be12b [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make_prog('prog', [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)])
    Job dfdd2ee2 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=prog.c, hash=dfa3aba7))
    Job 225f924d [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=lib.c, hash=a2e6cbd9))
    Job a8b21ec0 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.link('prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=7caa7f9c)])
  Job 32ed5cf8 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make_prog('prog2', [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)])
    Job 3f9ea7ae [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=prog2.c, hash=c748e4c7))
    Job 5707a358 [ DONE ] 2021-06-18 10:34:29:  redun.examples.compile.link('prog2', [File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=7caa7f9c)])

Notice, redun automatically detected that lib.c only needed to be compiled once and that its result can be reused (a form of common subexpression elimination).

Using the --file option, we can see all files (or URLs) that were read, r, or written, w, by the workflow:

redun log --file

File 2b6a7ce0 2021-06-18 11:41:42 r  lib.c
File d90885ad 2021-06-18 11:41:42 rw lib.o
File 2f43c23c 2021-06-18 11:41:42 w  prog
File dfa3aba7 2021-06-18 10:34:28 r  prog.c
File 4934098e 2021-06-18 10:34:28 rw prog.o
File b4537ad7 2021-06-18 11:41:42 w  prog2
File c748e4c7 2021-06-18 10:34:28 r  prog2.c
File cd0b6b7e 2021-06-18 10:34:28 rw prog2.o

We can also look at the provenance of a single file, such as the binary prog:

link(prog_path, o_files) prog_path = 'prog' o_files = [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)] prog_path <-- argument of make_prog(prog_path, c_files) <-- origin o_files <-- derives from compile_result = File(path=lib.o, hash=d90885ad) compile_result_2 = <4934098e> File(path=prog.o, hash=4934098e) compile_result <-- <45054a8f> compile(c_file) c_file = <2b6a7ce0> File(path=lib.c, hash=2b6a7ce0) c_file <-- argument of make_prog(prog_path, c_files) <-- argument of make(files) <-- origin compile_result_2 <-- <8d85cebc> compile(c_file_2) c_file_2 = File(path=prog.c, hash=dfa3aba7) c_file_2 <-- argument of <74cceb4e> make_prog(prog_path, c_files) <-- argument of <45400ab5> make(files) <-- origin ">
redun log prog

File 2f43c23c 2021-06-18 11:41:42 w  prog
Produced by Job a8b21ec0

  Job a8b21ec0-e60b-4486-bcf4-4422be265608 [ DONE ] 2021-06-18 11:41:42:  redun.examples.compile.link('prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)])
  Traceback: Exec 4a2b624d > (1 Job) > Job 2f8b4b5f make_prog > Job a8b21ec0 link
  Duration: 0:00:00.24

    CallNode 6c56c8d472dc1d07cfd2634893043130b401dc84 redun.examples.compile.link
      Args:   'prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]
      Result: File(path=prog, hash=2f43c23c)

    Task a20ef6dc2ab4ed89869514707f94fe18c15f8f66 redun.examples.compile.link

      def link(prog_path: str, o_files: List[File]) -> File:
          """
          Link several object files together into one program.
          """
          o_files=" ".join(o_file.path for o_file in o_files)
          os.system(f"gcc -o {prog_path} {o_files}")
          return File(prog_path)


    Upstream dataflow:

      result = File(path=prog, hash=2f43c23c)

      result <-- <6c56c8d4> link(prog_path, o_files)
        prog_path = 
          
            'prog'
        o_files   = 
           
             [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]

      prog_path <-- argument of 
            
              make_prog(prog_path, c_files)
                <-- origin

      o_files <-- derives from
        compile_result   = 
             
               File(path=lib.o, hash=d90885ad)
        compile_result_2 = <4934098e> File(path=prog.o, hash=4934098e)

      compile_result <-- <45054a8f> compile(c_file)
        c_file = <2b6a7ce0> File(path=lib.c, hash=2b6a7ce0)

      c_file <-- argument of 
              
                make_prog(prog_path, c_files) <-- argument of 
               
                 make(files) <-- origin compile_result_2 <-- <8d85cebc> compile(c_file_2) c_file_2 = 
                
                  File(path=prog.c, hash=dfa3aba7) c_file_2 <-- argument of <74cceb4e> make_prog(prog_path, c_files) <-- argument of <45400ab5> make(files) <-- origin 
                
               
              
             
            
           
          

This output shows the original link task source code responsible for creating the program prog, as well as the full derivation, denoted "upstream dataflow". See the full example for a deeper explanation of this output. To understand more about the data structure that powers these kind of queries, see call graphs.

We can change one of the input files, such as lib.c, and rerun the workflow. Due to redun's automatic incremental compute, only the minimal tasks are rerun:

redun run make.py make

[redun] redun :: version 0.4.15
[redun] config dir: /Users/rasmus/projects/redun/examples/compile/.redun
[redun] Start Execution 4a2b624d-b6c7-41cb-acca-ec440c2434db:  redun run make.py make
[redun] Run    Job 84d14769:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=2b6a7ce0)], 'prog2': [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=2b6a7ce0)]}) on default
[redun] Run    Job 2f8b4b5f:  redun.examples.compile.make_prog(prog_path='prog', c_files=[File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=2b6a7ce0)]) on default
[redun] Run    Job 4ae4eaf6:  redun.examples.compile.make_prog(prog_path='prog2', c_files=[File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=2b6a7ce0)]) on default
[redun] Cached Job 049a0006:  redun.examples.compile.compile(c_file=File(path=prog.c, hash=dfa3aba7)) (eval_hash=434cbbfe)
[redun] Run    Job 0f8df953:  redun.examples.compile.compile(c_file=File(path=lib.c, hash=2b6a7ce0)) on default
[redun] Cached Job 98d24081:  redun.examples.compile.compile(c_file=File(path=prog2.c, hash=c748e4c7)) (eval_hash=96ab0a2b)
[redun] Run    Job 8c95f048:  redun.examples.compile.link(prog_path='prog', o_files=[File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]) on default
[redun] Run    Job 9006bd19:  redun.examples.compile.link(prog_path='prog2', o_files=[File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=d90885ad)]) on default
[redun]
[redun] | JOB STATUS 2021/06/18 11:41:43
[redun] | TASK                             PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                    0       0       0       2       6       8
[redun] | redun.examples.compile.compile         0       0       0       2       1       3
[redun] | redun.examples.compile.link            0       0       0       0       2       2
[redun] | redun.examples.compile.make            0       0       0       0       1       1
[redun] | redun.examples.compile.make_prog       0       0       0       0       2       2
[redun]
[File(path=prog, hash=2f43c23c), File(path=prog2, hash=b4537ad7)]

Notice, two of the compile jobs are cached (prog.c and prog2.c), but compiling the library lib.c and the downstream link steps correctly rerun.

Check out the examples for more example workflows and features of redun. Also, see the design notes for more information on redun's design.

Mixed compute backends

In the above example, each task ran in its own thread. However, more generally each task can run in its own process, Docker container, AWS Batch job, or Spark job. With minimal configuration, users can lightly annotate where they would like each task to run. redun will automatically handle the data and code movement as well as backend scheduling:

@task(executor="process")
def a_process_task(a):
    # This task runs in its own process.
    b = a_batch_task(a)
    c = a_spark_task(b)
    return c

@task(executor="batch", memory=4, vcpus=5)
def a_batch_task(a):
    # This task runs in its own AWS Batch job.
    # ...

@task(executor="spark")
def a_spark_task(b):
    # This task runs in its own Spark job.
    sc = get_spark_context()
    # ...

See the executor documentation for more.

What's the trick?

How did redun automatically perform parallel compute, caching, and data provenance in the example above? The trick is that redun builds up an expression graph representing the workflow and evaluates the expressions using graph reduction. For example, the workflow above went through the following evaluation process:

For a more in-depth walk-through, see the scheduler tutorial.

Why not another workflow engine?

redun focuses on making multi-domain scientific pipelines easy to develop and deploy. The automatic parallelism, caching, code and data reactivity, as well as data provenance features makes it a great fit for such work. However, redun does not attempt to solve all possible workflow problems, so it's perfectly reasonable to supplement it with other tools. For example, while redun provides a very expressive way to define task parallelism, it does not attempt to perform the kind of fine-grain data parallelism more commonly provided by Spark or Dask. Fortunately, redun does not perform any "dirty tricks" (e.g. complex static analysis or call stack manipulation), and so we have found it possible to safely combine redun with other frameworks (e.g. pyspark, pytorch, Dask, etc) to achieve the benefits of each tool.

Lastly, redun does not provide its own compute cluster, but instead builds upon other systems that do, such as cloud provider services for batch Docker jobs or Spark jobs.

For more details on how redun compares to other related ideas, see the influences section.

Owner
insitro
insitro
Task dispatcher for Postgres

Features a task being ran as an OS process supports task queue with priority and process limit per node fully database driven (a worker and task can b

2 Dec 06, 2021
Ahmed Hossam 12 Oct 17, 2022
Extend the maya channel box with searchability and colour

channel-box-plus will add search-ability over its attributes, and it will colour user defined attributes, making them easier to distinguish.

Robert Joosten 12 Jun 08, 2022
A simple app that helps to train quick calculations.

qtcounter A simple app that helps to train quick calculations. Usage Manual Clone the repo in a folder using git clone https://github.com/Froloket64/q

0 Nov 27, 2021
Basic-Killfeed - A simple DayZ Console Killfeed

Basic-Killfeed A simple DayZ Console Killfeed. Setup Install Python Version 3.10

Nick 1 Apr 25, 2022
Linux GUI app to codon optimize many single-fasta files with coding sequences , using many taxonomy ids

codon_optimize_cds_with_many_taxids_singlefasta Linux GUI app to codon optimize many single-fasta files with coding sequences, using many taxonomy ids

Olga Tsiouri 1 Jan 23, 2022
Change your Windows background with this program safely & easily!

Background_Changer Table of Contents: About the Program Features Requirements Preview Credits Reach Me See Also About the Program: You can change your

Sina.f 0 Jul 14, 2022
oracle arm registration script.

oracle_arm oracle arm registration script. 乌龟壳刷ARM脚本 本脚本优点 简单,主机配置好oci,然后下载main.tf即可,不用自己获取各种参数。 运行环境配置 本简单脚本使用python3编写,请自行配置好python3环境和requests库。(高版

test1234455 419 Jan 01, 2023
MoBioTools A simple yet versatile toolkit to automatically setup quantum mechanics/molecular mechanics

A simple yet versatile toolkit to setup quantum mechanical/molecular mechanical (QM/MM) calculations from molecular dynamics trajectories.

MoBioChem 17 Nov 27, 2022
Python: Wrangled and unpivoted gaming datasets. Tableau: created dashboards - Market Beacon and Player’s Shopping Guide.

Created two information products for GameStop. Using Python, wrangled and unpivoted datasets, and created Tableau dashboards.

Zinaida Dvoskina 2 Jan 29, 2022
Extrator de dados do jupiterweb

Extrator de dados do jupiterweb O programa é composto de dois arquivos: Um constando apenas de classes complementares que representam as unidades e as

Bruno Aricó 2 Nov 28, 2022
Scraper pour les offres de stage Tesla et les notes sur Oasis (Polytech Paris-Saclay) sous forme de bot Discord

Scraper pour les offres de stage Tesla et les notes sur Oasis (Polytech Paris-Saclay) sous forme de bot Discord

Alexandre Malfreyt 1 Jan 21, 2022
Process GPX files (adding sensor metrics, uploading to InfluxDB, etc.) exported from imxingzhe.com

Xingzhe GPX Processor 行者轨迹处理工具 Xingzhe sells cheap GPS bike meters with sensor support including cadence, heart rate and power. But the GPX files expo

Shengqi Chen 8 Sep 23, 2022
Flames Calculater App used to calculate flames status between two names created using python's Flask web framework.

Flames Finder Web App Flames Calculater App used to calculate flames status between two names created using python's Flask web framework. First, App g

Siva Prakash 4 Jan 02, 2022
Render to print for blender 2.9+

render_to_print_blender_addon ** render2print: Blender AddOn for Blender 2.90.0+ ** Calculates camera parameters to allow printing a rendered image to

5 Nov 19, 2021
Zeus is an open source flight intellingence tool which supports more than 13,000+ airlines and 250+ countries.

Zeus Zeus is an open source flight intellingence tool which supports more than 13,000+ airlines and 250+ countries. Any flight worldwide, at your fing

DeVickey 1 Oct 22, 2021
A tool to quickly create codeforces contest directories with templates.

Codeforces Template Tool I created this tool to help me quickly set up codeforces contests/singular problems with templates. Tested for windows, shoul

1 Jun 02, 2022
PythonKafkaCompose is an upgrade of the amazing work done in liveMaps

PythonKafkaCompose is an upgrade of the amazing work done in liveMaps It is a simple project composed by: an instance of Kafka a Py

5 Jun 19, 2022
Creates a release pull request updating changelog and tags with standard-version

standard version release branch Github action to open releases following convent

8 Sep 13, 2022
A simple python project which control paint brush in microsoft paint app

Paint Buddy In Python A simple python project which control paint brush in micro

Ordinary Pythoneer 1 Dec 27, 2021