What are the best Systems? New Perspectives on NLP Benchmarking

Last update: Nov 03, 2022

Related tags

Overview

What are the best Systems? New Perspectives on NLP Benchmarking

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in {\it (i)} assessing the progress of new methods along different axes and {\it (ii)} selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (\textit{e.g.} GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (\textit{e.g.} GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

Authors:

Goal :

This repository deals with automatic evaluation of NLG and addresses the special case of reference based evaluation. The goal is to build a metric m: $m : \mathcal{S} \times \mathcal{S} \rightarrow \mathcal{R}$ where $m : \mathcal{S}$ is the space of sentences. An example is given below:

Overview

Limitations of Mean Aggregation

Counter Example

Kemeny Conscensus based Aggregation

Kemeny Conscensus

Aggregation when Task Level Information is available

Fig1. Production value and quantity of the 10 top commodities

SuperGLUE

XTREM

Toy Data

Toy Example

Aggregation when Instance Level Information is available

Reproducing the paper results

See notebooks.

References

If you find this repo useful, please cite our papers:

@article{,
  title={},
  author={},
  journal={},
  year={2022}
}

Usage

Python Function

Running our ranking is require a simple cpu.

We provide example inputs under <>.py. For example for BaryScore

Command Line Interface (CLI)

We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

# TASK LEVEL INFORMATION
export PATH_TO_DF_TO_RANK=sample_df/glue.csv
export MODE=task_level

python ranking_cli.py --df_to_rank=$PATH_TO_DF_TO_RANK --mode=$MODE

# INSTANCE LEVEL INFORMATION
export PATH_TO_DF_TO_RANK=sample_df/TAC_08.csv
export MODE=instance_level
python ranking_cli.py --df_to_rank=$PATH_TO_DF_TO_RANK --mode=$MODE

See more options by python score_cli.py -h.

Acknowledgements

This work was granted access to the HPC resources of IDRIS under the allocation 2021- 101838 made by GENCI. Nathan is funded by the projet ANR LIMPID.

What are the best Systems? New Perspectives on NLP Benchmarking

Related tags

Overview

What are the best Systems? New Perspectives on NLP Benchmarking

Authors:

Goal :

Overview

Limitations of Mean Aggregation

Kemeny Conscensus based Aggregation

Aggregation when Task Level Information is available

Toy Data

Aggregation when Instance Level Information is available

Reproducing the paper results

References

Usage

Python Function

Command Line Interface (CLI)

Acknowledgements

Owner

Pierre Colombo

Speech Recognition Database Management with python

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Sequence-to-Sequence Framework in PyTorch

Python functions for summarizing and improving voice dictation input.

Application for shadowing Chinese.

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

profile tools for pytorch nn models

NSFW A chatbot based on GPT2-chitchat

Rootski - Full codebase for rootski.io (without the data)

Higher quality textures for the Metal Gear Solid series.

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Converts python code into c++ by using OpenAI CODEX.

Graph Coloring - Weighted Vertex Coloring Problem

Python library for Serbian Natural language processing (NLP)

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

Yet Another Compiler Visualizer

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

What are the best Systems? New Perspectives on NLP Benchmarking

Related tags

Overview

What are the best Systems? New Perspectives on NLP Benchmarking

Authors:

Goal :

Overview

Limitations of Mean Aggregation

Kemeny Conscensus based Aggregation

Aggregation when Task Level Information is available

Toy Data

Aggregation when Instance Level Information is available

Reproducing the paper results

References

Usage

Python Function

Command Line Interface (CLI)

Acknowledgements

Owner

Pierre Colombo

Speech Recognition Database Management with python

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Sequence-to-Sequence Framework in PyTorch

Python functions for summarizing and improving voice dictation input.

Application for shadowing Chinese.

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

profile tools for pytorch nn models

**NSFW** A chatbot based on GPT2-chitchat

Rootski - Full codebase for rootski.io (without the data)

Higher quality textures for the Metal Gear Solid series.

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Converts python code into c++ by using OpenAI CODEX.

Graph Coloring - Weighted Vertex Coloring Problem

Python library for Serbian Natural language processing (NLP)

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

Yet Another Compiler Visualizer

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

NSFW A chatbot based on GPT2-chitchat