ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Overview

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

LOVE is accpeted by ACL22 main conference as a long paper (oral). This is a Pytorch implementation of our paper.

What is LOVE?

LOVE, Learning Out-of-Vocabulary Embeddings, is the name of our beautiful model given by Fabian Suchanek.

LOVE can produce word embeddings for arbitrary words, including out-of-vocabulary words like misspelled words, rare words, domain-specific words.....

Specifically, LOVE follows the principle of mimick-like models [2] to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words, as shown in the below figure.

mimic_model

To our best knowledge, LOVE is the first one to use contrastive learning for word-level representations. The framework is shown in the below figure, and it uses various data augmentations to generate positive samples. Another distinction is that LOVE adopts a novel fully attention-based encoder named PAM to mimic the vectors from pre-trained embeddings. You can find all details in our paper. mimic_model

The benefits of LOVE?

1. Impute vectors for unseen words

As we know, pre-trained embeddings like FastText use a fixed-size vocabulary, which means the performance decreases a lot when dealing with OOV words.

LOVE can mimic the behavior of pre-trained language models (including BERT) and impute vectors for any words.

For example, mispleling is a typo word, and LOVE can impute a reasonable vector for it:

from produce_emb import produce

oov_word = 'mispleling'
emb = produce(oov_word)
print(emb[oov_word][:10])

## output [-0.0582502  -0.11268596 -0.12599416  0.09926333  0.02513208  0.01140639
 -0.02326127 -0.007608    0.01973115  0.12448607]

2. Make LMs robust with little cost

LOVE can be used in a plug-and-play fashion with FastText and BERT, where it significantly improves their robustness. For example, LOVE with 6.5M can work with FastText (900+M) together and improve its robustness, as shown in the figure: mimic_model

The usage of LOVE

Clone the repository and set up the environment via "requirements.txt". Here we use python3.6.

pip install -r requirements.txt

Data preparation

In our experiments, we use the FastText as target vectors [1]. Downlaod. After downloading, put the embedding file in the path data/

Training

First you can use -help to show the arguments

python train.py -help

Once completing the data preparation and environment setup, we can train the model via train.py. We have also provided sample datasets, you can just run the mode without downloading.

python train.py -dataset data/wiki_100.vec

Evaulation

To show the intrinsic results of our model, you can use the following command and we have provided the trained model we used in our paper.

python evaluate.py

## expected output
model parameters:~6.5M
[RareWord]: [plugin], 42.6476207426462 
[MEN  ]: [plugin], 68.47815031602434 
[SimLex]: [plugin], 35.02258000865248 
[rel353]: [plugin], 55.8950046345804 
[simverb]: [plugin], 28.7233237185531 
[muturk]: [plugin], 63.77020916555088 

Reference

[1] Bojanowski, Piotr, et al. "Enriching word vectors with subword information." Transactions of the Association for Computational Linguistics 5 (2017): 135-146.

[2] Pinter, Yuval, Robert Guthrie, and Jacob Eisenstein. "Mimicking Word Embeddings using Subword RNNs." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

Owner
Lihu Chen
A PhD student of IP Paris! Enjoy Coding!
Lihu Chen
This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

About CappuccinoJs This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini! Este conversor criar

Arthur Ottoni Ribeiro 48 Nov 15, 2022
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Won Joon Yoo 335 Jan 04, 2023
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Graph4AI 230 Nov 22, 2022
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Neosapience 103 Dec 23, 2022
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 29, 2022
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

Roberto Sanchez 0 Aug 04, 2021
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

Build a Discord AI Chatbot that Speaks like Your Favorite Character! This is a Discord AI Chatbot that uses the Microsoft DialoGPT conversational mode

Lynn Zheng 231 Dec 30, 2022
A number of methods in order to perform Natural Language Processing on live data derived from Twitter

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

1 Nov 24, 2021