A benchmark of data-centric tasks from across the machine learning lifecycle.

Overview
banner

GitHub Workflow Status GitHub Documentation Status pre-commit PyPI - Python Version codecov

A benchmark of data-centric tasks from across the machine learning lifecycle.

Getting Started | What is dcbench? | Docs | Contributing | Website | About

⚡️ Quickstart

pip install dcbench

Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install dcbench[dev] instead. See setup.py for a full list of optional dependencies.

Installing from dev: pip install "dcbench[dev] @ git+https://github.com/data-centric-ai/[email protected]"

Using a Jupyter notebook or some other interactive environment, you can import the library and explore the data-centric problems in the benchmark:

import dcbench
dcbench.tasks

To learn more, follow the walkthrough in the docs.

💡 What is dcbench?

This benchmark evaluates the steps in your machine learning workflow beyond model training and tuning. This includes feature cleaning, slice discovery, and coreset selection. We call these “data-centric” tasks because they're focused on exploring and manipulating data – not training models. dcbench supports a growing list of them:

dcbench includes tasks that look very different from one another: the inputs and outputs of the slice discovery task are not the same as those of the minimal data cleaning task. However, we think it important that researchers and practitioners be able to run evaluations on data-centric tasks across the ML lifecycle without having to learn a bunch of different APIs or rewrite evaluation scripts.

So, dcbench is designed to be a common home for these diverse, but related, tasks. In dcbench all of these tasks are structured in a similar manner and they are supported by a common Python API that makes it easy to download data, run evaluations, and compare methods.

✉️ About

dcbench is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)

You might also like...
Data science, Data manipulation and Machine learning package.
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

Data Version Control or DVC is an open-source tool for data science and machine learning projects
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

A library of extension and helper modules for Python's data analysis and machine learning libraries.
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Comments
  •  No module named 'dcbench.tasks.budgetclean.cpclean'

    No module named 'dcbench.tasks.budgetclean.cpclean'

    After installing dcbench in Google colab environment, the above error was thrown for import dcbench. Full error traceback,

    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-8-a1030f6d7ef9> in <module>()
          1 
    ----> 2 import dcbench
          3 dcbench.tasks
    
    2 frames
    /usr/local/lib/python3.7/dist-packages/dcbench/__init__.py in <module>()
         13 )
         14 from .config import config
    ---> 15 from .tasks.budgetclean import BudgetcleanProblem
         16 from .tasks.minidata import MiniDataProblem
         17 from .tasks.slice_discovery import SliceDiscoveryProblem
    
    /usr/local/lib/python3.7/dist-packages/dcbench/tasks/budgetclean/__init__.py in <module>()
          3 from ...common import Task
          4 from ...common.table import Table
    ----> 5 from .baselines import cp_clean, random_clean
          6 from .common import Preprocessor
          7 from .problem import BudgetcleanProblem, BudgetcleanSolution
    
    /usr/local/lib/python3.7/dist-packages/dcbench/tasks/budgetclean/baselines.py in <module>()
          6 from ...common.baseline import baseline
          7 from .common import Preprocessor
    ----> 8 from .cpclean.algorithm.select import entropy_expected
          9 from .cpclean.algorithm.sort_count import sort_count_after_clean_multi
         10 from .cpclean.clean import CPClean, Querier
    
    ModuleNotFoundError: No module named 'dcbench.tasks.budgetclean.cpclean'
    

    !pip install dcbench gave the following log

    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 
    flask 1.1.4 requires click<8.0,>=5.1, but you have click 8.0.3 which is incompatible.
    datascience 0.10.6 requires coverage==3.7.1, but you have coverage 6.2 which is incompatible.
    datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
    coveralls 0.5 requires coverage<3.999,>=3.6, but you have coverage 6.2 which is incompatible.
    Successfully installed SecretStorage-3.3.1 aiohttp-3.8.1 aiosignal-1.2.0 antlr4-python3-runtime-4.8 async-timeout-4.0.2 asynctest-0.13.0 black-21.12b0 cfgv-3.3.1 click-8.0.3 colorama-0.4.4 commonmark-0.9.1 coverage-6.2 cryptography-36.0.1 cytoolz-0.11.2 dataclasses-0.6 datasets-1.17.0 dcbench-0.0.4 distlib-0.3.4 docformatter-1.4 flake8-4.0.1 frozenlist-1.2.0 fsspec-2021.11.1 future-0.18.2 fuzzywuzzy-0.18.0 fvcore-0.1.5.post20211023 huggingface-hub-0.2.1 identify-2.4.1 importlib-metadata-4.2.0 iopath-0.1.9 isort-5.10.1 jeepney-0.7.1 jsonlines-3.0.0 keyring-23.4.0 livereload-2.6.3 markdown-3.3.4 mccabe-0.6.1 meerkat-ml-0.2.3 multidict-5.2.0 mypy-extensions-0.4.3 nbsphinx-0.8.8 nodeenv-1.6.0 omegaconf-2.1.1 parameterized-0.8.1 pathspec-0.9.0 pkginfo-1.8.2 platformdirs-2.4.1 pluggy-1.0.0 portalocker-2.3.2 pre-commit-2.16.0 progressbar-2.5 pyDeprecate-0.3.1 pycodestyle-2.8.0 pyflakes-2.4.0 pytest-6.2.5 pytest-cov-3.0.0 pytorch-lightning-1.5.7 pyyaml-6.0 readme-renderer-32.0 recommonmark-0.7.1 requests-toolbelt-0.9.1 rfc3986-1.5.0 sphinx-autobuild-2021.3.14 sphinx-rtd-theme-1.0.0 torchmetrics-0.6.2 twine-3.7.1 typed-ast-1.5.1 ujson-5.1.0 untokenize-0.1.1 virtualenv-20.12.1 xxhash-2.0.2 yacs-0.1.8 yarl-1.7.2
    WARNING: The following packages were previously imported in this runtime:
      [pydevd_plugins]
    You must restart the runtime in order to use newly installed versions.
    

    python version : 3.7.12 platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic

    opened by mathav95raj 2
  • Slice discovery problem p_72411 misses files

    Slice discovery problem p_72411 misses files

    Hi,

    Thanks for this great tool!

    I'm loading slice discovery problems, however, the problem p_72411 misses files. Can you fix this SD problem?

    FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.dcbench/slice_discovery/problem/artifacts/p_72411/test_predictions.mk/meta.yaml'
    
    opened by duguyue100 0
Releases(v-0.0.1-beta)
Spark development environment for k8s

Local Spark Dev Env with Docker Development environment for k8s. Using the spark-operator image to ensure it will be the same environment. Start conta

Otacilio Filho 18 Jan 04, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 07, 2023
Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

Viktoria Maksymiuk 1 Jun 27, 2022
Code Repository for Machine Learning with PyTorch and Scikit-Learn

Code Repository for Machine Learning with PyTorch and Scikit-Learn

Sebastian Raschka 1.4k Jan 03, 2023
Pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code

pandas-method-chaining pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code. It is a fork from pandas-v

Francis 5 May 14, 2022
PySurvival is an open source python package for Survival Analysis modeling

PySurvival What is Pysurvival ? PySurvival is an open source python package for Survival Analysis modeling - the modeling concept used to analyze or p

Square 265 Dec 27, 2022
Accelerating model creation and evaluation.

EmeraldML A machine learning library for streamlining the process of (1) cleaning and splitting data, (2) training, optimizing, and testing various mo

Yusuf 0 Dec 06, 2021
The Fuzzy Labs guide to the universe of open source MLOps

Open Source MLOps This is the Fuzzy Labs guide to the universe of free and open source MLOps tools. Contents What is MLOps, anyway? Data version contr

Fuzzy Labs 352 Dec 29, 2022
Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along with material in the form of Jupyter Notebooks.

Databricks Certification Spark Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along

19 Dec 13, 2022
Learn Machine Learning Algorithms by doing projects in Python and R Programming Language

Learn Machine Learning Algorithms by doing projects in Python and R Programming Language. This repo covers all aspect of Machine Learning Algorithms.

Ravi Chaubey 6 Oct 20, 2022
slim-python is a package to learn customized scoring systems for decision-making problems.

slim-python is a package to learn customized scoring systems for decision-making problems. These are simple decision aids that let users make yes-no p

Berk Ustun 37 Nov 02, 2022
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
Anomaly Detection and Correlation library

luminol Overview Luminol is a light weight python library for time series data analysis. The two major functionalities it supports are anomaly detecti

LinkedIn 1.1k Jan 01, 2023
Climin is a Python package for optimization, heavily biased to machine learning scenarios

climin climin is a Python package for optimization, heavily biased to machine learning scenarios distributed under the BSD 3-clause license. It works

Biomimetic Robotics and Machine Learning at Technische Universität München 177 Sep 02, 2022
A Multipurpose Library for Synthetic Time Series Generation in Python

TimeSynth Multipurpose Library for Synthetic Time Series Please cite as: J. R. Maat, A. Malali, and P. Protopapas, “TimeSynth: A Multipurpose Library

278 Dec 26, 2022
Machine Learning approach for quantifying detector distortion fields

DistortionML Machine Learning approach for quantifying detector distortion fields. This project is a feasibility study for training a surrogate model

Joel Bernier 1 Nov 05, 2021
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made th

Krishna Priyatham Potluri 73 Dec 01, 2022
stability-selection - A scikit-learn compatible implementation of stability selection

stability-selection - A scikit-learn compatible implementation of stability selection stability-selection is a Python implementation of the stability

185 Dec 03, 2022
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Amazon Web Services - Labs 3.3k Jan 03, 2023
A machine learning web application for binary classification using streamlit

Machine Learning web App This is a machine learning web application for binary classification using streamlit options this application contains 3 clas

abdelhak mokri 1 Dec 20, 2021