PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Last update: Dec 27, 2022

Related tags

Deep Learning PClean

Overview

PClean

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Warning: This is a rapidly evolving research prototype.

PClean was created at the MIT Probabilistic Computing Project.

If you use PClean in your research, please cite the our 2021 AISTATS paper:

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March). In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)

Using PClean

To use PClean, create a Julia file with the following structure:

using PClean
using DataFrames: DataFrame
import CSV

# Load data
data = CSV.File(filepath) |> DataFrame

# Define PClean model
PClean.@model MyModel begin
    @class ClassName1 begin
        ...
    end

    ...
    
    @class ClassNameN begin
        ...
    end
end

# Align column names of CSV with variables in the model.
# Format is ColumnName CleanVariable DirtyVariable, or, if
# there is no corruption for a certain variable, one can omit
# the DirtyVariable.
query = @query MyModel.ClassNameN [
  HospitalName hosp.name             observed_hosp_name
  Condition    metric.condition.desc observed_condition
  ...
]

# Configure observed dataset
observations = [ObservedDataset(query, data)]

# Configuration
config = PClean.InferenceConfig(1, 2; use_mh_instead_of_pg=true)

# SMC initialization
state = initialize_trace(observations, config)

# Rejuvenation sweeps
run_inference!(state, config)

# Evaluate accuracy, if ground truth is available
ground_truth = CSV.File(filepath) |> CSV.DataFrame
results = evaluate_accuracy(data, ground_truth, state, query)

# Can print results.f1, results.precision, results.accuracy, etc.
println(results)

# Even without ground truth, can save the entire latent database to CSV files:
PClean.save_results(dir, dataset_name, state, observations)

Then, from this directory, run the Julia file.

JULIA_PROJECT=. julia my_file.jl

To learn to write a PClean model, see our paper, but note the surface syntax changes described below.

Differences from the paper

As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax, from the stand-alone syntax presented in our paper:

(1) Instead of latent class C ... end, we write @class C begin ... end.

(2) Instead of subproblem begin ... end, inference hints are given using ordinary Julia begin ... end blocks.

(3) Instead of parameter x ~ d(...), we use @learned x :: D{...}. The set of distributions D for parameters is somewhat restricted.

(4) Instead of x ~ d(...) preferring E, we write x ~ d(..., E).

(5) Instead of observe x as y, ... from C, write @query ModelName.C [x y; ...]. Clauses of the form x z y are also allowed, and tell PClean that the model variable C.z represents a clean version of x, whose observed (dirty) version is modeled as C.y. This is used when automatically reconstructing a clean, flat dataset.

The names of built-in distributions may also be different, e.g. AddTypos instead of typos, and ProportionsParameter instead of dirichlet.

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Related tags

Overview

PClean

Using PClean

Differences from the paper

Owner

MIT Probabilistic Computing Project

A repository for the updated version of CoinRun used to collect MUGEN, a multimodal video-audio-text dataset.

Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

This code is an unofficial implementation of HiFiSinger.

[CVPR 2020] Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

Unsupervised Learning of Multi-Frame Optical Flow with Occlusions

Implementation of "RaScaNet: Learning Tiny Models by Raster-Scanning Image" from CVPR 2021.

PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

PyTorch implementation of UPFlow (unsupervised optical flow learning)

Py-faster-rcnn - Faster R-CNN (Python implementation)

Transformers based fully on MLPs

GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.

An OpenAI Gym environment for Super Mario Bros

This repository contains all code and data for the Inside Out Visual Place Recognition task

Pytorch Performace Tuning, WandB, AMP, Multi-GPU, TensorRT, Triton

A PyTorch re-implementation of the paper 'Exploring Simple Siamese Representation Learning'. Reproduced the 67.8% Top1 Acc on ImageNet.

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

EvDistill: Asynchronous Events to End-task Learning via Bidirectional Reconstruction-guided Cross-modal Knowledge Distillation (CVPR'21)

Image Segmentation with U-Net Algorithm on Carvana Dataset using AWS Sagemaker

Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"