source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Last update: Dec 17, 2022

Related tags

Overview

WhiteningBERT

Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Preparation

git clone https://github.com/Jun-jie-Huang/WhiteningBERT.git
pip install -r requirements.txt
cd examples/evaluation

Usage

Datasets

We use seven STS datasets, including STSBenchmark, SICK-Relatedness, STS12, STS13, STS14, STS15, STS16.

The processed data can be found in ./examples/datasets/.

Run

To run a quick demo:

python evaluation_stsbenchmark.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased

Specify --pooing with cls or aver to choose whether use the [CLS] token or averaging all tokens. Also specify --layer_num to combine layers, separated by a comma.

To enumerate all possible combinations of two layers and automatically evaluate the combinations consequently:

python evaluation_stsbenchmark_layer2.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased

To enumerate all possible combinations of N layers:

python evaluation_stsbenchmark_layerN.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased\
			--combination_num 4

You can also save the embeddings of the sentences

python evaluation_stsbenchmark_save_embed.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased \
			--summary_dir ./save_embeddings

A list of PLMs you can select:

bert-base-uncased , bert-large-uncased
roberta-base, roberta-large
bert-base-multilingual-uncased
sentence-transformers/LaBSE
albert-base-v1 , albert-large-v1
microsoft/layoutlm-base-uncased , microsoft/layoutlm-large-uncased
SpanBERT/spanbert-base-cased , SpanBERT/spanbert-large-cased
microsoft/deberta-base , microsoft/deberta-large
google/electra-base-discriminator
google/mobilebert-uncased
microsoft/DialogRPT-human-vs-rand
distilbert-base-uncased
......

Acknowledgements

Codes are adapted from the repos of the EMNLP19 paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks and the EMNLP20 paper An Unsupervised Sentence Embedding Method by Mutual Information Maximization

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Related tags

Overview

WhiteningBERT

Preparation

Usage

Datasets

Run

A list of PLMs you can select:

Acknowledgements

Owner

Chinese NER with albert/electra or other bert descendable model (keras)

CoSENT 比Sentence-BERT更有效的句向量方案

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Index different CKAN entities in Solr, not just datasets

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

A NLP program: tokenize method, PoS Tagging with deep learning

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

To be a next-generation DL-based phenotype prediction from genome mutations.

Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

CPC-big and k-means clustering for zero-resource speech processing

GPT-3: Language Models are Few-Shot Learners

The swas programming language

MEDIALpy: MEDIcal Abbreviations Lookup in Python

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

A fast hierarchical dimensionality reduction algorithm.

Shellcode antivirus evasion framework