Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

Last update: Oct 07, 2021

Overview

BLEU Score

Implementation for paper:

BLEU: a Method for Automatic Evaluation of Machine Translation

Author: Ba Ngoc from ProtonX

BLEU score is a popular metric to evaluate machine translation. Check out the recent Transformer project we published.

I. Usage

from bleu_score import cal_corpus_bleu_score

candidates = ['eating chicken chicken is a eating a eating chicken',
              'eating chicken chicken is not good']
references_list = [['a chicken is eating chicken', 'there is a chicken eating chicken'], [
    'a chicken is eating chicken', 'there is a chicken eating chicken']]

bleu_score = cal_corpus_bleu_score(candidates, references_list,
                      weights=(0.25, 0.25, 0.25, 0.25), N=4)

print('Bleu Score: {}'.format(bleu_score))

II. BLEU Score Formula

1. Precision

We count specific n-grams in the candidates and the number of those grams in the references. Then we calculate the proportion of two countings and get the precision.

Important to note: Count clip means that the number of typical n-grams can not exceed the maximum number of that n-grams in any single reference.

For example: if ('a', 'a') gram exists 3 times in a candidate. However, the maximum number of this gram in any single reference is 2. So we will use value 2 for calculation.

If you never heard about grams? It means that we count the number of continuous substrings with a pre-set length in a string.

Candidate 1: 'eating chicken chicken is a eating a eating chicken'

-------Unigram------


eating	3
chicken	3
is	1
a	2

-------bigrams------


eating chicken	2
chicken chicken	1
chicken is	1
is a	1
a eating	2
eating a	1

We can do the same thing with trigrams and 4-grams

2. Sentence brevity penalty

We prefer the reference with a length that is closest to the candidate's.

Checkout function get_eff_ref_length in utils.py.

c: the total lengths of all candidates

r: the total lengths of all effective reference lengths

3. BLEU Formula

N: the number of grams

w: list of pre-set weight for each gram

Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

Related tags

Overview

BLEU Score

1. Precision

2. Sentence brevity penalty

3. BLEU Formula

Owner

Ngoc Nguyen Ba

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Fidibo.com comments Sentiment Analyser

Shared, streaming Python dict

👑 spaCy building blocks and visualizers for Streamlit apps

🤕 spelling exceptions builder for lazy people

NLP Text Classification

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Transformation spoken text to written text

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

多语言降噪预训练模型MBart的中文生成任务

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Training RNNs as Fast as CNNs

Sequence modeling benchmarks and temporal convolutional networks

This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0