NAACL2021 - COIL Contextualized Lexical Retriever

Overview

COIL

Repo for our NAACL paper, COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. The code covers learning COIL models well as encoding and retrieving with COIL index.

The code was refactored from our original experiment version to use the huggingface Trainer interface for future compatibility.

Contextualized Exact Lexical Match

COIL systems are based on the idea of contextualized exact lexical match. It replaces term frequency based term matching in classical systems like BM25 with contextualized word representation similarities. It thereby gains the ability to model matching of context. Meanwhile COIL confines itself to comparing exact lexical matched tokens and therefore can retrieve efficiently with inverted list form data structure. Details can be found in our paper.

Dependencies

The code has been tested with,

pytorch==1.8.1
transformers==4.2.1
datasets==1.1.3

To use the retriever, you need in addition,

torch_scatter==2.0.6
faiss==1.7.0

Resource

MSMARCO Passage Ranking

Here we present two systems: one uses hard negatives (HN) and the other does not. COIL w/o HN is trained with BM25 negatives, and COIL w/ HN is trained in addition with hard negatives mined with another trained COIL.

Configuration MARCO DEV [email protected] TREC DL19 [email protected] TREC DL19 [email protected] Chekpoint MARCO Train Ranking MARCO Dev Ranking
COIL w/o HN 0.353 0.7285 0.7136 model-checkpoint.tar.gz train-ranking.tar.gz dev-ranking.tsv
COIL w/ HN 0.373 0.7453 0.7055 hn-checkpoint.tar.gz train-ranking.tar.gz dev-ranking.tsv
  • Right Click to Download.
  • The COIL w/o HN model was a rerun as we lost the original checkpoint. There's a slight difference in dev performance, about 0.5% and also some improvement on the DL2019 test.

Tokenized data and model checkpoint link

Hard negative data and model checkpoint link

more to be added soon

Usage

The following sections will work through how to use this code base to train and retrieve over the MSMARCO passage ranking data set.

Training

You can download the train file psg-train.tar.gz for BERT from our resource link. Alternatively, you can run pre-processing by yourself following the pre-processing instructions.

Extract the training set from the tar ball and run the following code to launch training for msmarco passage.

python run_marco.py \  
  --output_dir $OUTDIR \  
  --model_name_or_path bert-base-uncased \  
  --do_train \  
  --save_steps 4000 \  
  --train_dir /path/to/psg-train \  
  --q_max_len 16 \  
  --p_max_len 128 \  
  --fp16 \  
  --per_device_train_batch_size 8 \  
  --train_group_size 8 \  
  --cls_dim 768 \  
  --token_dim 32 \  
  --warmup_ratio 0.1 \  
  --learning_rate 5e-6 \  
  --num_train_epochs 5 \  
  --overwrite_output_dir \  
  --dataloader_num_workers 16 \  
  --no_sep \  
  --pooling max 

Encoding

After training, you can encode the corpus splits and queries.

You can download pre-processed data for BERT, corpus.tar.gz, queries.{dev, eval}.small.json here.

for i in $(seq -f "%02g" 0 99)  
do  
  mkdir ${ENCODE_OUT_DIR}/split${i}  
  python run_marco.py \  
    --output_dir $ENCODE_OUT_DIR \  
    --model_name_or_path $CKPT_DIR \  
    --tokenizer_name bert-base-uncased \  
    --cls_dim 768 \  
    --token_dim 32 \  
    --do_encode \  
    --no_sep \  
    --p_max_len 128 \  
    --pooling max \  
    --fp16 \  
    --per_device_eval_batch_size 128 \  
    --dataloader_num_workers 12 \  
    --encode_in_path ${TOKENIZED_DIR}/split${i} \  
    --encoded_save_path ${ENCODE_OUT_DIR}/split${i}
done

If on a cluster, the encoding loop can be paralellized. For example, say if you are on a SLURM cluster, use srun,

for i in $(seq -f "%02g" 0 99)  
do  
  mkdir ${ENCODE_OUT_DIR}/split${i}  
  srun --ntasks=1 -c4 --mem=16000 -t0 --gres=gpu:1 python run_marco.py \  
    --output_dir $ENCODE_OUT_DIR \  
    --model_name_or_path $CKPT_DIR \  
    --tokenizer_name bert-base-uncased \  
    --cls_dim 768 \  
    --token_dim 32 \  
    --do_encode \  
    --no_sep \  
    --p_max_len 128 \  
    --pooling max \  
    --fp16 \  
    --per_device_eval_batch_size 128 \  
    --dataloader_num_workers 12 \  
    --encode_in_path ${TOKENIZED_DIR}/split${i} \  
    --encoded_save_path ${ENCODE_OUT_DIR}/split${i}&
done

Then encode the queries,

python run_marco.py \  
  --output_dir $ENCODE_QRY_OUT_DIR \  
  --model_name_or_path $CKPT_DIR \  
  --tokenizer_name bert-base-uncased \  
  --cls_dim 768 \  
  --token_dim 32 \  
  --do_encode \  
  --p_max_len 16 \  
  --fp16 \  
  --no_sep \  
  --pooling max \  
  --per_device_eval_batch_size 128 \  
  --dataloader_num_workers 12 \  
  --encode_in_path $TOKENIZED_QRY_PATH \  
  --encoded_save_path $ENCODE_QRY_OUT_DIR

Note that here p_max_len always controls the maximum length of the encoded text, regardless of the input type.

Retrieval

To do retrieval, run the following steps,

(Note that there is no dependency in the for loop within each step, meaning that if you are on a cluster, you can distribute the jobs across nodes using srun or qsub.)

  1. build document index shards
for i in $(seq 0 9)  
do  
 python retriever/sharding.py \  
   --n_shards 10 \  
   --shard_id $i \  
   --dir $ENCODE_OUT_DIR \  
   --save_to $INDEX_DIR \  
   --use_torch
done  
  1. reformat encoded query
python retriever/format_query.py \  
  --dir $ENCODE_QRY_OUT_DIR \  
  --save_to $QUERY_DIR \  
  --as_torch
  1. retrieve from each shard
for i in $(seq -f "%02g" 0 9)  
do  
  python retriever/retriever-compat.py \  
      --query $QUERY_DIR \  
      --doc_shard $INDEX_DIR/shard_${i} \  
      --top 1000 \  
      --save_to ${SCORE_DIR}/intermediate/shard_${i}.pt
done 
  1. merge scores from all shards
python retriever/merger.py \  
  --score_dir ${SCORE_DIR}/intermediate/ \  
  --query_lookup  ${QUERY_DIR}/cls_ex_ids.pt \  
  --depth 1000 \  
  --save_ranking_to ${SCORE_DIR}/rank.txt

python data_helpers/msmarco-passage/score_to_marco.py \  
  --score_file ${SCORE_DIR}/rank.txt

Note that this compat(ible) version of retriever differs from our internal retriever. It relies on torch_scatter package for scatter operation so that we can have a pure python code that can easily work across platforms. We do notice that on our system torch_scatter does not scale very well with number of cores. We may in the future release another faster version of retriever that requires some compiling work.

Data Format

For both training and encoding, the core code expects pre-tokenized data.

Training Data

Training data is grouped by query into one or several json files where each line has a query, its corresponding positives and negatives.

{
    "qry": {
        "qid": str,
        "query": List[int],
    },
    "pos": List[
        {
            "pid": str,
            "passage": List[int],
        }
    ],
    "neg": List[
        {
            "pid": str,
            "passage": List[int]
        }
    ]
}

Encoding Data

Encoding data is also formatted into one or several json files. Each line corresponds to an entry item.

{"pid": str, "psg": List[int]}

Note that for code simplicity, we share this format for query/passage/document encoding.

Owner
Luyu Gao
NLP Research [email protected], CMU
Luyu Gao
NeurIPS 2021, "Fine Samples for Learning with Noisy Labels"

[Official] FINE Samples for Learning with Noisy Labels This repository is the official implementation of "FINE Samples for Learning with Noisy Labels"

mythbuster 27 Dec 23, 2022
PyTorch/TorchScript compiler for NVIDIA GPUs using TensorRT

PyTorch/TorchScript compiler for NVIDIA GPUs using TensorRT

NVIDIA Corporation 1.8k Dec 30, 2022
本项目是一个带有前端界面的垃圾分类项目,加载了训练好的模型参数,模型为efficientnetb4,暂时为40分类问题。

说明 本项目是一个带有前端界面的垃圾分类项目,加载了训练好的模型参数,模型为efficientnetb4,暂时为40分类问题。 python依赖 tf2.3 、cv2、numpy、pyqt5 pyqt5安装 pip install PyQt5 pip install PyQt5-tools 使用 程

4 May 04, 2022
VQGAN+CLIP Colab Notebook with user-friendly interface.

VQGAN+CLIP and other image generation system VQGAN+CLIP Colab Notebook with user-friendly interface. Latest Notebook: Mse regulized zquantize Notebook

Justin John 227 Jan 05, 2023
Monocular Depth Estimation - Weighted-average prediction from multiple pre-trained depth estimation models

merged_depth runs (1) AdaBins, (2) DiverseDepth, (3) MiDaS, (4) SGDepth, and (5) Monodepth2, and calculates a weighted-average per-pixel absolute dept

Pranav 39 Nov 21, 2022
PyTorch implementation for the ICLR 2020 paper "Understanding the Limitations of Variational Mutual Information Estimators"

Smoothed Mutual Information ``Lower Bound'' Estimator PyTorch implementation for the ICLR 2020 paper Understanding the Limitations of Variational Mutu

50 Nov 09, 2022
"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

SOLQ: Segmenting Objects by Learning Queries This repository is an official implementation of the paper SOLQ: Segmenting Objects by Learning Queries.

MEGVII Research 179 Jan 02, 2023
Using machine learning to predict undergrad college admissions.

College-Prediction Project- Overview: Many have tried, many have failed. Few trailblazers are ambitious enought to chase acceptance into the top 15 un

John H Klinges 1 Jan 05, 2022
Implementation of FitVid video prediction model in JAX/Flax.

FitVid Video Prediction Model Implementation of FitVid video prediction model in JAX/Flax. If you find this code useful, please cite it in your paper:

Google Research 62 Nov 25, 2022
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

Pan Lu 81 Dec 27, 2022
PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

Christoph Reich 10 Jan 02, 2023
Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU

Cross-modal Retrieval using Transformer Encoder Reasoning Networks This project reimplements the idea from "Transformer Reasoning Network for Image-Te

Minh-Khoi Pham 5 Nov 05, 2022
Dynamic hair modeling from monocular videos using deep neural networks

Dynamic Hair Modeling The source code of the networks for our paper "Dynamic hair modeling from monocular videos using deep neural networks" (SIGGRAPH

53 Oct 18, 2022
REGTR: End-to-end Point Cloud Correspondences with Transformers

REGTR: End-to-end Point Cloud Correspondences with Transformers This repository contains the source code for REGTR. REGTR utilizes multiple transforme

Zi Jian Yew 108 Dec 17, 2022
Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)'

SCL Introduction Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)' We evaluated our approach using two baseline

34 Oct 08, 2022
Exploring Simple 3D Multi-Object Tracking for Autonomous Driving (ICCV 2021)

Exploring Simple 3D Multi-Object Tracking for Autonomous Driving Chenxu Luo, Xiaodong Yang, Alan Yuille Exploring Simple 3D Multi-Object Tracking for

QCraft 141 Nov 21, 2022
Eff video representation - Efficient video representation through neural fields

Neural Residual Flow Fields for Efficient Video Representations 1. Download MPI

41 Jan 06, 2023
Learn about Spice.ai with in-depth samples

Samples Learn about Spice.ai with in-depth samples ServerOps - Learn when to run server maintainance during periods of low load Gardener - Intelligent

Spice.ai 16 Mar 23, 2022
CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing

CapsuleVOS This is the code for the ICCV 2019 paper CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing. Arxiv Link: https://a

53 Oct 27, 2022
Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma This repo provi

Jingtao Zhan 99 Dec 27, 2022