Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Last update: Dec 03, 2022

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

We propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously drives progress in language generation tasks and their evaluation. We accept two types of submissions:

Generator developers submit output text. A Billboard computes all metric scores.
Metric developers submit an executable program. A Billboard computes correlations with the human judgments, updates the ensemble metric, and measures how much it overrates machine over human generations.

Anonymous submissions are allowed!!

Submit

Submission guides and examples are available here.

Scoring Results

Scoring results for all past public submissions are available here. We have generator-name||metric-name.csv files from the Cartesian product between the generators and metrics: each contains instance-level scores.

Citations

Bidimesional Leaderboards

@misc{kasai2021billboard,
    title   = {Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Ronan Le Bras and Lavinia Dunagan and Jacob Morrison and Alexander R. Fabbri and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2112.04139}, 
}

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

@misc{kasai2021thumb,
    title   = {Transparent Human Evaluation for Image Captioning},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Lavinia Dunagan and Jacob Morrison and Ronan Le Bras and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2111.08940}, 
}

CNNDM Summarization Evaluations

@article{fabbri2021summeval,
    title   = {{SummEval}: Re-evaluating Summarization Evaluation},
    author  = {Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
    journal = {TACL},
    year    = {2021},
    url     = {https://arxiv.org/abs/2007.12626},
}

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

@misc{freitag2021experts,
      title={Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation}, 
      author={Markus Freitag and George Foster and David Grangier and Viresh Ratnakar and Qijun Tan and Wolfgang Macherey},
      year={2021},
      url={https://arxiv.org/abs/2104.14478},
}

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

Submit

Scoring Results

Citations

Bidimesional Leaderboards

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

CNNDM Summarization Evaluations

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

Owner

CLEAR algorithm for multi-view data association

This repository contains the needed resources to build the HIRID-ICU-Benchmark dataset

Automatic meme generation model using Tensorflow Keras.

OpenDILab Multi-Agent Environment

DualGAN-tensorflow: tensorflow implementation of DualGAN

Equivariant Imaging: Learning Beyond the Range Space

A library for hidden semi-Markov models with explicit durations

Fully Convlutional Neural Networks for state-of-the-art time series classification

PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision.

The Wearables Development Toolkit - a development environment for activity recognition applications with sensor signals

Pytorch Implementation for NeurIPS (oral) paper: Pixel Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

This is a repository for a Semantic Segmentation inference API using the Gluoncv CV toolkit

Easy to use Audio Tagging in PyTorch

Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Memory Defense: More Robust Classificationvia a Memory-Masking Autoencoder

Code for the paper "Reinforced Active Learning for Image Segmentation"

GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation. (CVPR 2021)

Official implementation for TTT++: When Does Self-supervised Test-time Training Fail or Thrive

Improved Fitness Optimization Landscapes for Sequence Design