The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

Last update: Nov 03, 2022

Related tags

Text Data & NLP analogy-language-model

Overview

BERT is to NLP what AlexNet is to CV

This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies? (the camera-ready version of the paper is here) which has been accepted by the ACL 2021 main conference. We evaluate pretrained language models (LM) on five analogy tests that follow SAT-style format as below.

QUERY word:language
OPTION
  (1) paint:portrait
  (2) poetry:rhythm 
  (3) note:music <-- the answer!
  (4) tale:story
  (5) week:year

We devise a new class of scoring functions, referred to as analogical proportion (AP) scores, to solve word analogies in an unsurpervised fashion and investigate the relational knowledge that LM learnt through pretraining.

Please see our paper for more information and discussion.

Get started

git clone https://github.com/asahi417/analogy-language-model
cd analogy-language-model
pip install -e .

Run Experiments

The following scripts reproduce our results in the paper.

# get result for our main AP score
python experiments/experiment_ppl_variants.py 
# get result for word embedding baseline
python experiments/experiment_word_embedding.py 
# get result for other scoring function such as vector difference, etc
python experiments/experiment_scoring_comparison.py

Here's the result summary that can be attained by running those scripts.

experimental results

Dataset

The datasets used in our experiments can be downloaded from the following link:

Analogy Datasets

Please see the Analogy Tool for more information about the dataset and baselines.

Citation

Please cite our reference paper if you use our data or code:

@inproceedings{ushio-etal-2021-bert,
    title = "{BERT} is to {NLP} what {A}lex{N}et is to {CV}: Can Pre-Trained Language Models Identify Analogies?",
    author = "Ushio, Asahi  and
      Espinosa Anke, Luis  and
      Schockaert, Steven  and
      Camacho-Collados, Jose",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.280",
    doi = "10.18653/v1/2021.acl-long.280",
    pages = "3609--3624",
    abstract = "Analogies play a central role in human commonsense reasoning. The ability to recognize analogies such as {``}eye is to seeing what ear is to hearing{''}, sometimes referred to as analogical proportions, shape how we structure knowledge and understand language. Surprisingly, however, the task of identifying such analogies has not yet received much attention in the language model era. In this paper, we analyze the capabilities of transformer-based language models on this unsupervised task, using benchmarks obtained from educational settings, as well as more commonly used datasets. We find that off-the-shelf language models can identify analogies to a certain extent, but struggle with abstract and complex relations, and results are highly sensitive to model architecture and hyperparameters. Overall the best results were obtained with GPT-2 and RoBERTa, while configurations using BERT were not able to outperform word embedding models. Our results raise important questions for future work about how, and to what extent, pre-trained language models capture knowledge about abstract semantic relations.",
}

Please also cite the relevant reference papers if using any of the analogy datasets.

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 Corpora 📃 Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models 🤖 RoBERTa-base BNE: https://huggingface.co

203 Dec 20, 2022

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

27 Dec 12, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

3.1k Jan 8, 2023

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

Related tags

Overview

BERT is to NLP what AlexNet is to CV

Get started

Run Experiments

Dataset

Citation

You might also like...

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

The official repository of the ISBI 2022 KNIGHT Challenge

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

SAINT PyTorch implementation

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Releases(0.0.0)

0.0.0(Apr 29, 2021)

Owner

Asahi Ushio

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

CoSENT 比Sentence-BERT更有效的句向量方案

Sequence Modeling with Structured State Spaces

A list of NLP(Natural Language Processing) tutorials

A simple word search made in python

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

A telegram bot to translate 100+ Languages

✨Fast Coreference Resolution in spaCy with Neural Networks

Convolutional Neural Networks for Sentence Classification

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

A simple implementation of N-gram language model.

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers