InferSent sentence embeddings

Overview

InferSent

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

We provide our pre-trained English sentence encoder from our paper and our SentEval evaluation toolkit.

Recent changes: Removed train_nli.py and only kept pretrained models for simplicity. Reason is I do not have time anymore to maintain the repo beyond simple scripts to get sentence embeddings.

Dependencies

This code is written in python. Dependencies include:

  • Python 2/3
  • Pytorch (recent version)
  • NLTK >= 3

Download word vectors

Download GloVe (V1) or fastText (V2) vectors:

mkdir GloVe
curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip GloVe/glove.840B.300d.zip -d GloVe/
mkdir fastText
curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
unzip fastText/crawl-300d-2M.vec.zip -d fastText/

Use our sentence encoder

We provide a simple interface to encode English sentences. See demo.ipynb for a practical example. Get started with the following steps:

0.0) Download our InferSent models (V1 trained with GloVe, V2 trained with fastText)[147MB]:

mkdir encoder
curl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

Note that infersent1 is trained with GloVe (which have been trained on text preprocessed with the PTB tokenizer) and infersent2 is trained with fastText (which have been trained on text preprocessed with the MOSES tokenizer). The latter also removes the padding of zeros with max-pooling which was inconvenient when embedding sentences outside of their batches.

0.1) Make sure you have the NLTK tokenizer by running the following once:

import nltk
nltk.download('punkt')

1) Load our pre-trained model (in encoder/):

from models import InferSent
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

2) Set word vector path for the model:

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

3) Build the vocabulary of word vectors (i.e keep only those needed):

infersent.build_vocab(sentences, tokenize=True)

where sentences is your list of n sentences. You can update your vocabulary using infersent.update_vocab(sentences), or directly load the K most common English words with infersent.build_vocab_k_words(K=100000). If tokenize is True (by default), sentences will be tokenized using NTLK.

4) Encode your sentences (list of n sentences):

embeddings = infersent.encode(sentences, tokenize=True)

This outputs a numpy array with n vectors of dimension 4096. Speed is around 1000 sentences per second with batch size 128 on a single GPU.

5) Visualize the importance that our model attributes to each word:

We provide a function to visualize the importance of each word in the encoding of a sentence:

infersent.visualize('A man plays an instrument.', tokenize=True)

Model

Evaluate the encoder on transfer tasks

To evaluate the model on transfer tasks, see SentEval. Be mindful to choose the same tokenization used for training the encoder. You should obtain the following test results for the baselines and the InferSent models:

Model MR CR SUBJ MPQA STS14 STS Benchmark SICK Relatedness SICK Entailment SST TREC MRPC
InferSent1 81.1 86.3 92.4 90.2 .68/.65 75.8/75.5 0.884 86.1 84.6 88.2 76.2/83.1
InferSent2 79.7 84.2 92.7 89.4 .68/.66 78.4/78.4 0.888 86.3 84.3 90.8 76.0/83.8
SkipThought 79.4 83.1 93.7 89.3 .44/.45 72.1/70.2 0.858 79.5 82.9 88.4 -
fastText-BoV 78.2 80.2 91.8 88.0 .65/.63 70.2/68.3 0.823 78.9 82.3 83.4 74.4/82.4

Reference

Please consider citing [1] if you found this code useful.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (EMNLP 2017)

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

@InProceedings{conneau-EtAl:2017:EMNLP2017,
  author    = {Conneau, Alexis  and  Kiela, Douwe  and  Schwenk, Holger  and  Barrault, Lo\"{i}c  and  Bordes, Antoine},
  title     = {Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {670--680},
  url       = {https://www.aclweb.org/anthology/D17-1070}
}

Related work

Owner
Facebook Research
Facebook Research
Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

5 Jun 01, 2022
VoiceFixer VoiceFixer is a framework for general speech restoration.

VoiceFixer VoiceFixer is a framework for general speech restoration. We aim at the restoration of severly degraded speech and historical speech. Paper

Leo 174 Jan 06, 2023
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

Tomoki Hayashi 1.2k Dec 23, 2022
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 05, 2023
Yuqing Xie 2 Feb 17, 2022
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

Feast 52 Jan 05, 2023
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 06, 2021
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Simplemma: a simple multilingual lemmatizer for Python Purpose Lemmatization is the process of grouping together the inflected forms of a word so they

Adrien Barbaresi 70 Dec 29, 2022
Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

张宇 6 Dec 20, 2022
🏆 • 5050 most frequent words in 109 languages

🏆 Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. 🔗 License source code license data

14 Nov 24, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
TalkNet: Audio-visual active speaker detection Model

Is someone talking? TalkNet: Audio-visual active speaker detection Model This repository contains the code for our ACM MM 2021 paper, TalkNet, an acti

142 Dec 14, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
👄 The most accurate natural language detection library for Python, suitable for long and short text alike

1. What does this library do? Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a prepr

Peter M. Stahl 334 Dec 30, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022