BEAMetrics: Benchmark to Evaluate Automatic Metrics in Natural Language Generation

Overview

BEAMetrics: Benchmark to Evaluate Automatic Metrics in Natural Language Generation

Installing The Dependencies

$ conda create --name beametrics python>=3.8
$ conda activate beametrics

WARNING: You need to install, before any package, correct version of pytorch linked to your cuda version.

(beametrics) $ conda install pytorch cudatoolkit=10.1 -c pytorch

Install BEAMetrics:

(beametrics) $ cd BEAMetrics
(beametrics) $ pip install -e .

Install Nubia metric (not on PyPI, 16/08/2021):

(beametrics) git clone https://github.com/wl-research/nubia.git
(beametrics) pip install -r requirements.txt

Alternatively, you can remove nubia from _DEFAULT_METRIC_NAMES in metrics.metric_reporter.

Reproducing the results

First you need to get the processed files, which include the metric scores. You can do that either by simply downloading the processed data (see Section Download Data), or by re-computing the scores (see Section Compute Correlations).

Then, the first bloc in the notebook visualize.ipynb allows to get all the tables from the paper (and also to generate the latex code in data/correlation).

Download the data

All the dataset can be downloaded from this zip file. It needs to be unzipped into the path data before running the correlations.

unzip data.zip

The data folder contains:

  • a subfolder raw containing all the original dataset
  • a subfolder processed containing all the dataset processed in a unified format
  • a subfolder correlation containing all the final correlation results, and the main tables of the paper
  • a subfolder datacards containing all the data cards

Computing the correlations

Processing the files to a clean json with the metrics computed:

python beametrics/run_all.py

The optional argument --dataset allows to compute only on a specific dataset, e.g.:

python run_all.py --dataset SummarizationCNNDM.

The list of the datasets and their corresponding configuration can be found in configs/__init__.

When finished, you can print the final table as in the paper, see the notebook visualize.ipynb.

Data Cards:

For each dataset, a data card is available in the datacard folder. The cards are automatically generated when running run_all.py, by filling the template with the dataset configuration as detailed bellow, in Adding a new dataset.

Adding a new dataset:

In configs/, you need to create a new .py file that inherites from ConfigBase (in configs/co'nfig_base.py). You are expected to fill the mandatory fields that allow to run the code and fill the data card template:

  • file_name: the file name located in data/raw
  • file_name_processed: the file name once processed and formated
  • metric_names: you can pass _DEFAULT_METRIC_NAMES by default or customize it, e.g. metric_names = metric_names + ('sari',) where sari corresponds to a valid metric (see the next section)
  • name_dataset: the name of the dataset as it was published
  • short_name_dataset: few letters that will be used to name the dataset in the final table report
  • languages: the languages of the dataset (e.g. [en] or [en, fr])
  • task: e.g. 'simplification', 'data2text
  • number_examples: the total number of evaluated texts
  • nb_refs: the number of references available in the dataset
  • dimensions_definitions: the evaluated dimensions and their corresponding definition e.g. {'fluency: 'How fluent is the text?'}
  • scale: the scale used during the evaluation, as defined in the protocol
  • source_eval_sets: the dataset from which the source were collected to generate the evaluated examples
  • annotators: some information about who were the annotators
  • sampled_from: the URL where was released the evaluation dataset
  • citation: the citation of the paper where the dataset was released

Your class needs its custom method format_file. The function takes as input the dataset's file_name and return a dictionary d_data. The format for d_data has to be the same for all the datasets:

d_data = {
    key_1: {
        'source': "a_source", 
        'hypothesis': "an_hypothesis",
        'references': ["ref_1", "ref_2", ...],
        'dim_1': float(a_score),
        'dim_2': float(an_other_score),
    },
    ...
    key_n: {
        ...
    }
}

where 'key_1' and 'key_n' are the keys for the first and nth example, dim_1 and dim_2 dimensions corresponding to self.dimensions.

Finally, you need to add your dataset to the dictionary D_ALL_DATASETS located in config/__init__.

Adding a new metric:

First, create a class inheriting from metrics/metrics/MetricBase. Then, simply add it to the dictionary _D_METRICS in metrics/__init__.

For the metric to be computed by default, its name has to be added to either

  • _DEFAULT_METRIC_NAMES: metrics computed on each dataset
  • _DEFAULT_METRIC_NAMES_SRC: metrics computed on dataset that have a text format for their source (are excluded for now image captioning and data2text). These two tuples are located in metrics/metric_reported.

Alternatively, you can add the metric to a specific configuration by adding it to the attribute metric_names in the config.

This repository contains an implementation of the Permutohedral Attention Module in Pytorch

Permutohedral_attention_module This repository contains an implementation of the Permutohedral Attention Module

Samuel JOUTARD 26 Nov 27, 2022
Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Peter Lin 6.5k Jan 04, 2023
Benchmarks for Model-Based Optimization

Design-Bench Design-Bench is a benchmarking framework for solving automatic design problems that involve choosing an input that maximizes a black-box

Brandon Trabucco 43 Dec 20, 2022
Code for Talk-to-Edit (ICCV2021). Paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog.

Talk-to-Edit (ICCV2021) This repository contains the implementation of the following paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog Yumin

Yuming Jiang 221 Jan 07, 2023
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

137 Dec 15, 2022
Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021)

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021) PyTorch implementation of Learning RAW-to-sRGB Mappings with Inaccurat

Zhilu Zhang 53 Dec 20, 2022
This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model inference.

PyTorch Infer Utils This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model infer

Alex Gorodnitskiy 11 Mar 20, 2022
Image-to-image regression with uncertainty quantification in PyTorch

Image-to-image regression with uncertainty quantification in PyTorch. Take any dataset and train a model to regress images to images with rigorous, distribution-free uncertainty quantification.

Anastasios Angelopoulos 25 Dec 26, 2022
Convolutional Neural Network for 3D meshes in PyTorch

MeshCNN in PyTorch SIGGRAPH 2019 [Paper] [Project Page] MeshCNN is a general-purpose deep neural network for 3D triangular meshes, which can be used f

Rana Hanocka 1.4k Jan 04, 2023
Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL)

Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL) This repository contains all source code used to generate the results in the article "

Charlotte Loh 3 Jul 23, 2022
This is the implementation of the paper LiST: Lite Self-training Makes Efficient Few-shot Learners.

LiST (Lite Self-Training) This is the implementation of the paper LiST: Lite Self-training Makes Efficient Few-shot Learners. LiST is short for Lite S

Microsoft 28 Dec 07, 2022
Definition of a business problem according to Wilson Lower Bound Score and Time Based Average Rating

Wilson Lower Bound Score, Time Based Rating Average In this study I tried to calculate the product rating and sorting reviews more accurately. I have

3 Sep 30, 2021
Code release for Convolutional Two-Stream Network Fusion for Video Action Recognition

Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer 676 Dec 31, 2022
Deep Markov Factor Analysis (NeurIPS2021)

Deep Markov Factor Analysis (DMFA) Codes and experiments for deep Markov factor analysis (DMFA) model accepted for publication at NeurIPS2021: A. Farn

Sarah Ostadabbas 2 Dec 16, 2022
Texture mapping with variational auto-encoders

vae-textures This is an experiment with using variational autoencoders (VAEs) to perform mesh parameterization. This was also my first project using J

Alex Nichol 41 May 24, 2022
Implementation of Ag-Grid component for Streamlit

streamlit-aggrid AgGrid is an awsome grid for web frontend. More information in https://www.ag-grid.com/. Consider purchasing a license from Ag-Grid i

Pablo Fonseca 556 Dec 31, 2022
TransVTSpotter: End-to-end Video Text Spotter with Transformer

TransVTSpotter: End-to-end Video Text Spotter with Transformer Introduction A Multilingual, Open World Video Text Dataset and End-to-end Video Text Sp

weijiawu 66 Dec 26, 2022
Individual Treatment Effect Estimation

CAPE Individual Treatment Effect Estimation Run CAPE python train_causal.py --loop 10 -m cape_cau -d NI --i_t 1 Run a baseline model python train_cau

S. Deng 4 Sep 02, 2022
PyTorch implementation of the paper Deep Networks from the Principle of Rate Reduction

Deep Networks from the Principle of Rate Reduction This repository is the official PyTorch implementation of the paper Deep Networks from the Principl

459 Dec 27, 2022