BERT Attention Analysis

Overview

BERT Attention Analysis

This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attention maps from BERT and writing them to disk, analyzing BERT's attention in general (sections 3 and 6 of the paper), and comparing its attention to dependency syntax (sections 4.2 and 5). We will add the code for the coreference resolution analysis (section 4.3 of the paper) soon!

Requirements

For extracting attention maps from text:

Additional requirements for the attention analysis:

Attention Analysis

Syntax_Analysis.ipynb and General_Analysis.ipynb contain code for analyzing BERT's attention, including reproducing the figures and tables in the paper.

You can download the data needed to run the notebooks (including BERT attention maps on Wikipedia and the Penn Treebank) from here. However, note that the Penn Treebank annotations are not freely available, so the Penn Treebank data only includes dummy labels. If you want to run the analysis on your own data, you can use the scripts described below to extract BERT attention maps.

Extracting BERT Attention Maps

We provide a script for running BERT over text and writing the resulting attention maps to disk. The input data should be a JSON file containing a list of dicts, each one corresponding to a single example to be passed in to BERT. Each dict must contain exactly one of the following fields:

  • "text": A string.
  • "words": A list of strings. Needed if you want word-level rather than token-level attention.
  • "tokens": A list of strings corresponding to BERT wordpiece tokenization.

If the present field is "tokens," the script expects [CLS]/[SEP] tokens to be already added; otherwise it adds these tokens to the beginning/end of the text automatically. Note that if an example is longer than max_sequence_length tokens after BERT wordpiece tokenization, attention maps will not be extracted for it. Attention extraction adds two additional fields to each dict:

  • "attns": A numpy array of size [num_layers, heads_per_layer, sequence_length, sequence_length] containing attention weights.
  • "tokens": If "tokens" was not already provided for the example, the BERT-wordpiece-tokenized text (list of strings).

Other fields already in the feature dicts will be preserved. For example if each dict has a tags key containing POS tags, they will stay in the data after attention extraction so they can be used when analyzing the data.

Attention extraction is run with

python extract_attention.py --preprocessed_data_file 
   
     --bert_dir 
    

    
   

The following optional arguments can also be added:

  • --max_sequence_length: Maximum input sequence length after tokenization (default is 128).
  • --batch_size: Batch size when running BERT over examples (default is 16).
  • --debug: Use a tiny BERT model for fast debugging.
  • --cased: Do not lowercase the input text.
  • --word_level: Compute word-level instead of token-level attention (see Section 4.1 of the paper).

The feature dicts with added attention maps (numpy arrays with shape [n_layers, n_heads_per_layer, n_tokens, n_tokens]) are written to _attn.pkl

Pre-processing Scripts

We include two pre-processing scripts for going from a raw data file to JSON that can be supplied to attention_extractor.py.

preprocess_unlabeled.py does BERT-pre-training-style preprocessing for unlabeled text (i.e, taking two consecutive text spans, truncating them so they are at most max_sequence_length tokens, and adding [CLS]/[SEP] tokens). Each line of the input data file should be one sentence. Documents should be separated by empty lines. Example usage:

python preprocess_unlabeled.py --data-file $ATTN_DATA_DIR/unlabeled.txt --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12

will create the file $ATTN_DATA_DIR/unlabeled.json containing pre-processed data. After pre-processing, you can run extract_attention.py to get attention maps, e.g.,

python extract_attention.py --preprocessed-data-file $ATTN_DATA_DIR/unlabeled.json --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12

preprocess_depparse.py pre-processes dependency parsing data. Dependency parsing data should consist of two files train.txt and dev.txt under a common directory. Each line in the files should contain a word followed by a space followed by - (e.g., 0-root). Examples should be separated by empty lines. Example usage:

python preprocess_depparse.py --data-dir $ATTN_DATA_DIR/depparse

After pre-processing, you can run extract_attention.py to get attention maps, e.g.,

python extract_attention.py --preprocessed-data-file $ATTN_DATA_DIR/depparse/dev.json --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12 --word_level

Computing Distances Between Attention Heads

head_distances.py computes the average Jenson-Shannon divergence between the attention weights of all pairs of attention heads and writes the results to disk as a numpy array of shape [n_heads, n_heads]. These distances can be used to cluster BERT's attention heads (see Section 6 and Figure 6 of the paper; code for doing this clustering is in General_Analysis.ipynb). Example usage (requires that attention maps have already been extracted):

python head_distances.py --attn-data-file $ATTN_DATA_DIR/unlabeled_attn.pkl --outfile $ATTN_DATA_DIR/head_distances.pkl

Citation

If you find the code or data helpful, please cite the original paper:

@inproceedings{clark2019what,
  title = {What Does BERT Look At? An Analysis of BERT's Attention},
  author = {Kevin Clark and Urvashi Khandelwal and Omer Levy and Christopher D. Manning},
  booktitle = {[email protected]},
  year = {2019}
}

Contact

Kevin Clark (@clarkkev).

Owner
Kevin Clark
Kevin Clark
TLA - Twitter Linguistic Analysis

TLA - Twitter Linguistic Analysis Tool for linguistic analysis of communities TLA is built using PyTorch, Transformers and several other State-of-the-

Tushar Sarkar 47 Aug 14, 2022
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
Simple NLP based project without any use of AI

Simple NLP based project without any use of AI

Shripad Rao 1 Apr 26, 2022
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

FALLABOUT-SRMMIC 21 POETRY-GENERATION HINGLISH DESCRIPTION We have developed a NLP(natural language processing) model which automatically generates a

7 Sep 28, 2021
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
Reproduction process of BERT on SST2 dataset

BERT-SST2-Prod Reproduction process of BERT on SST2 dataset 安装说明 下载代码库 git clone https://github.com/JunnYu/BERT-SST2-Prod 进入文件夹,安装requirements pip ins

yujun 1 Nov 18, 2021
This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

NORESQA: Speech Quality Assessment using Non-Matching References This is a Pytorch implementation for using NORESQA. It contains minimal code to predi

Meta Research 36 Dec 08, 2022
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
An A-SOUL Text Generator Based on CPM-Distill.

ASOUL-Generator-Backend 本项目为 https://asoul.infedg.xyz/ 的后端。 模型为基于 CPM-Distill 的 transformers 转化版本 CPM-Generate-distill 训练而成。

infinityedge 46 Dec 11, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
Pretrained Japanese BERT models

Pretrained Japanese BERT models This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face. Mod

Inui Laboratory 387 Dec 30, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023