SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Overview

img

THUIR License made-with-python code-size

Introduction

This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper.

Requirements

  • python 3.7
  • torch==1.9.0
  • transformers==4.9.2
  • tqdm, nltk, numpy, boto3
  • trec_eval for evaluation on TREC DL 2019
  • anserini for generating "RANK" axiom scores

Why this repo?

In this repo, you can pre-train ARESsimple and TransformerICT models, and fine-tune all pre-trained models with the same architecture as BERT. The papers are listed as follows:

You can download the pre-trained ARES checkpoint ARESsimple from Google drive and extract it.

Pre-training Data

Download data

Download the MS MARCO corpus from the official website.
Download the ADORE+STAR Top100 Candidates files from this repo.

Pre-process data

To save memory, we store most files using the numpy memmap or jsonl format in the ./preprocess directory.

Document files:

  • doc_token_ids.memmap: each line is the token ids for a document
  • docid2idx.json: {docid: memmap_line_id}

Query files:

  • queries.doctrain.jsonl: MS MARCO training queries {"id" qid, "ids": token_ids} for each line
  • queries.docdev.jsonl: MS MARCO validating queries {"id" qid, "ids": token_ids} for each line
  • queries.dl2019.jsonl: TREC DL 2019 queries {"id" qid, "ids": token_ids} for each line

Human label files:

  • msmarco-doctrain-qrels.tsv: qid 0 docid 1 for training set
  • dev-qrels.txt: qid relevant_docid for validating set
  • 2019qrels-docs.txt: qid relevant_docid for TREC DL 2019 set

Top 100 candidate files:

  • train.rank.tsv, dev.rank.tsv, test.rank.tsv: qid docid rank for each line

Pseudo queries and axiomatic features:

  • doc2qs.jsonl: {"docid": docid, "queries": [qids]} for each line
  • sample_qs_token_ids.memmap: each line is the token ids for a pseudo query
  • sample_qid2id.json: {qid: memmap_line_id}
  • axiom.memmap: axiom can be one of the ['rank', 'prox-1', 'prox-2', 'rep-ql', 'rep-tfidf', 'reg', 'stm-1', 'stm-2', 'stm-3'], each line is an axiomatic score for a query

Quick Start

Note that to accelerate the training process, we adopt the parallel training technique. The scripts for pre-training and fine-tuning are as follow:

Pre-training

export BERT_DIR=/path/to/bert-base/
export XGB_DIR=/path/to/xgboost.model

cd pretrain

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 NCCL_BLOCKING_WAIT=1 \
python  -m torch.distributed.launch --nproc_per_node=6 --nnodes=1 train.py \
        --model_type ARES \
        --PRE_TRAINED_MODEL_NAME BERT_DIR \
        --gpu_num 6 --world_size 6 \
        --MLM --axiom REP RANK REG PROX STM \
        --clf_model XGB_DIR

Here model type can be ARES or ICT.

Zero-shot evaluation (based on AS top100)

export MODEL_DIR=/path/to/ares-simple/
export CKPT_NAME=ares.ckpt

cd finetune

CUDA_VISIBLE_DEVICES=0 python train.py \
        --test \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --model_type ARES \
        --model_name ARES_simple \
        --load_ckpt \
        --model_path CKPT_NAME

You can get:

#####################
<----- MS Dev ----->
MRR @10: 0.2991
MRR @100: 0.3130
QueriesRanked: 5193
#####################

on MS MARCO dev set and:

#############################
<--------- DL 2019 --------->
QueriesRanked: 43
nDCG @10: 0.5955
nDCG @100: 0.4863
#############################

on DL 2019 set.

Fine-tuning

export MODEL_DIR=/path/to/ares-simple/

cd finetune

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_BLOCKING_WAIT=1 \
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py \
        --model_type ARES \
        --distributed_train \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --gpu_num 4 --world_size 4 \
        --model_name ARES_simple

Visualization

export MODEL_DIR=/path/to/ares-simple/
export SAVE_DIR=/path/to/output/
export CKPT_NAME=ares.ckpt

cd visualization

CUDA_VISIBLE_DEVICES=0 python visual.py \
    --PRE_TRAINED_MODEL_NAME MODEL_DIR \
    --model_name ARES_simple \
    --visual_q_num 1 \
    --visual_d_num 5 \
    --save_path SAVE_DIR \
    --model_path CKPT_NAME

Results

Zero-shot performance:

Model Name MS MARCO [email protected] MS MARCO [email protected] DL [email protected] DL [email protected] COVID EQ
BM25 0.2962 0.3107 0.5776 0.4795 0.4857 0.6690
BERT 0.1820 0.2012 0.4059 0.4198 0.4314 0.6055
PROPwiki 0.2429 0.2596 0.5088 0.4525 0.4857 0.5991
PROPmarco 0.2763 0.2914 0.5317 0.4623 0.4829 0.6454
ARESstrict 0.2630 0.2785 0.4942 0.4504 0.4786 0.6923
AREShard 0.2627 0.2780 0.5189 0.4613 0.4943 0.6822
ARESsimple 0.2991 0.3130 0.5955 0.4863 0.4957 0.6916

Few-shot performance: img

Visualization (attribution values have been normalized within a document): img

Citation

If you find our work useful, please do not save your star and cite our work:

@inproceedings{chen2022axiomatically,
  title={Axiomatically Regularized Pre-training for Ad hoc Search},
  author={Chen, Jia and Liu, Yiqun and Fang, Yan and Mao, Jiaxin and Fang, Hui and Yang, Shenghao and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

Notice

  • Please make sure that all the pre-trained model parameters have been loaded correctly, or the zero-shot and the fine-tuning performance will be greatly impacted.
  • We welcome anyone who would like to contribute to this repo. 🤗
  • If you have any other questions, please feel free to contact me via [email protected] or open an issue.
  • Code for data preprocessing will come soon. Please stay tuned~
Owner
Jia Chen
My life is a beauty. 🦋
Jia Chen
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

537 Jan 05, 2023
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 02, 2023
Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

Shutdown 213 Dec 22, 2022
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 02, 2023
Contains links to publicly available datasets for modeling health outcomes using speech and language.

speech-nlp-datasets Contains links to publicly available datasets for modeling various health outcomes using speech and language. Speech-based Corpora

Tuka Alhanai 77 Dec 07, 2022
A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

Yuval Rosen 14 Dec 07, 2022
Deep Learning Topics with Computer Vision & NLP

Deep learning Udacity Course Deep Learning Topics with Computer Vision & NLP for the AWS Machine Learning Engineer Nanodegree Program Tasks are mostly

Simona Mircheva 1 Jan 20, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022