This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' published at ECIR'22.

Related tags

Deep Learningparm
Overview

Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval

This repository contains the code for the paper PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval and is partly based on the DPR Github repository. PARM is a Paragraph Aggregation Retrieval Model for dense document-to-document retrieval tasks, which liberates dense passage retrieval models from their limited input lenght and does retrieval on the paragraph-level.

We focus on the task of legal case retrieval and train and evaluate our models on the COLIEE 2021 data and evaluate our models on the CaseLaw collection.

The dense retrieval models are trained on the COLIEE data and can be found here. For training the dense retrieval model we utilize the DPR Github repository.

PARM Workflow

If you use our models or code, please cite our work:

@inproceedings{althammer2022parm,
      title={Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval}, 
      author={Althammer, Sophia and Hofstätter, Sebastian and Sertkan, Mete and Verberne, Suzan and Hanbury, Allan},
      year={2022},
      booktitle={Advances in Information Retrieval, 44rd European Conference on IR Research, ECIR 2022},
}

Training the dense retrieval model

The dense retrieval models need to be trained, either on the paragraph-level data of COLIEE Task2 or additionally on the document-level data of COLIEE Task1

  • ./DPR/train_dense_encoder.py: trains the dense bi-encoder (Step1)
python -m torch.distributed.launch --nproc_per_node=2 train_dense_encoder.py 
--max_grad_norm 2.0 
--encoder_model_type hf_bert 
--checkpoint_file_name --insert path to pretrained encoder checkpoint here if available-- 
--model_file  --insert path to pretrained chechpoint here if available-- 
--seed 12345 
--sequence_length 256 
--warmup_steps 1237 
--batch_size 22 
--do_lower_case 
--train_file --path to json train file-- 
--dev_file --path to json val file-- 
--output_dir --path to output directory--
--learning_rate 1e-05
--num_train_epochs 70
--dev_batch_size 22
--val_av_rank_start_epoch 60
--eval_per_epoch 1
--global_loss_buf_sz 250000

Generate dense embeddings index with trained DPR model

  • ./DPR/generate_dense_embeddings.py: encodes the corpus in the dense index (Step2)
python generate_dense_embeddings.py
--model_file --insert path to pretrained checkpoint here from Step1--
--pretrained_file  --insert path to pretrained chechpoint here from Step1--
--ctx_file --insert path to tsv file with documents in the corpus--
--out_file --insert path to output index--
--batch_size 750

Search in the dense index

  • ./DPR/dense_retriever.py: searches in the dense index the top n-docs (Step3)
python dense_retriever.py 
--model_file --insert path to pretrained checkpoint here from Step1--
--ctx_file --insert path to tsv file with documents in the corpus--
--qa_file --insert path to csv file with the queries--
--encoded_ctx_file --path to the dense index (.pkl format) from Step2--
--out_file --path to .json output file for search results--
--n-docs 1000

Poolout dense vectors for aggregation step

First you need to get the dense embeddings for the query paragraphs:

  • ./DPR/get_question_tensors.py: encodes the query paragraphs with the dense encoder checkpoint and stores the embeddings in the output file (Step4)
python get_question_tensors.py
--model_file --insert path to pretrained checkpoint here from Step1--
--qa_file --insert path to csv file with the queries--
--out_file --path to output file for output index--

Once you have the dense embeddings of the paragraphs in the index and of the questions, you do the vector-based aggregation step in PARM with VRRF (alternatively with Min, Max, Avg, Sum, VScores, VRanks) and evaluate the aggregated results

  • ./representation_aggregation.py: aggregates the run, stores and evaluates the aggregated run (Step5)
python representation_aggregation.py
--encoded_ctx_file --path to the encoded index (.pkl format) from Step2--
--encoded_qa_file  --path to the encoded queries (.pkl format) from Step4--
--output_top1000s --path to the top-n file (.json format) from Step3--
--label_file  --path to the label file (.json format)--
--aggregation_mode --choose from vrrf/vscores/vranks/sum/max/min/avg
--candidate_mode p_from_retrieved_list
--output_dir --path to output directory--
--output_file_name  --output file name--

Preprocessing

Preprocess COLIEE Task 1 data for dense retrieval

  • ./preprocessing/preprocess_coliee_2021_task1.py: preprocess the COLIEE Task 1 dataset by removing non-English text, removing non-informative summaries, removing tabs etc

Preprocess CaseLaw collection

  • ./preprocessing/caselaw_stat_corpus.py: preprocess the CaseLaw collection

Preprocess data for training the dense retrieval model

In order to train the dense retrieval models, the data needs to be preprocessed. For training and retrieval we split up the documents into their paragraphs.

  • ./preprocessing/preprocess_finetune_data_dpr_task1.py: preprocess the COLIEE Task 1 document-level labels for training the DPR model

  • ./preprocessing/preprocess_finetune_data_dpr.py: preprocess the COLIEE Task 2 paragraph-level labels for training the DPR model

Owner
Sophia Althammer
PhD student @TuVienna Interested in IR and NLP https://sophiaalthammer.github.io/ Currently working on the dossier project to https://dossier-project.eu/
Sophia Althammer
Spatially-Adaptive Pixelwise Networks for Fast Image Translation, CVPR 2021

Image Translation with ASAPNets Spatially-Adaptive Pixelwise Networks for Fast Image Translation, CVPR 2021 Webpage | Paper | Video Installation insta

Tamar Rott Shaham 100 Dec 28, 2022
Experiments with the Robust Binary Interval Search (RBIS) algorithm, a Query-Based prediction algorithm for the Online Search problem.

OnlineSearchRBIS Online Search with Best-Price and Query-Based Predictions This is the implementation of the Robust Binary Interval Search (RBIS) algo

S. K. 1 Apr 16, 2022
This repository comes with the paper "On the Robustness of Counterfactual Explanations to Adverse Perturbations"

Robust Counterfactual Explanations This repository comes with the paper "On the Robustness of Counterfactual Explanations to Adverse Perturbations". I

Marco 5 Dec 20, 2022
RefineGNN - Iterative refinement graph neural network for antibody sequence-structure co-design (RefineGNN)

Iterative refinement graph neural network for antibody sequence-structure co-des

Wengong Jin 83 Dec 31, 2022
Official Pytorch implementation of the paper: "Locally Shifted Attention With Early Global Integration"

Locally-Shifted-Attention-With-Early-Global-Integration Pretrained models You can download all the models from here. Training Imagenet python -m torch

Shelly Sheynin 14 Apr 15, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
An original implementation of "Noisy Channel Language Model Prompting for Few-Shot Text Classification"

Channel LM Prompting (and beyond) This includes an original implementation of Sewon Min, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer. "Noisy Cha

Sewon Min 92 Jan 07, 2023
This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

Omnimatte in PyTorch This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effect

Erika Lu 728 Dec 28, 2022
Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks

OnsagerNet Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks This is the original pyTorch implemenati

Haijun.Yu 3 Aug 24, 2022
TC-GNN with Pytorch integration

TC-GNN (Running Sparse GNN on Dense Tensor Core on Ampere GPU) Cite this project and paper. @inproceedings{TC-GNN, title={TC-GNN: Accelerating Spars

YUKE WANG 19 Dec 01, 2022
Deployment of PyTorch chatbot with Flask

Chatbot Deployment with Flask and JavaScript In this tutorial we deploy the chatbot I created in this tutorial with Flask and JavaScript. This gives 2

Patrick Loeber (Python Engineer) 107 Dec 29, 2022
PyTorch implementation of DUL (Data Uncertainty Learning in Face Recognition, CVPR2020)

PyTorch implementation of DUL (Data Uncertainty Learning in Face Recognition, CVPR2020)

Mouxiao Huang 20 Nov 15, 2022
FFTNet vocoder implementation

Unofficial Implementation of FFTNet vocode paper. implement the model. implement tests. overfit on a single batch (sanity check). linearize weights fo

Eren Gölge 81 Dec 08, 2022
TransReID: Transformer-based Object Re-Identification

TransReID: Transformer-based Object Re-Identification [arxiv] The official repository for TransReID: Transformer-based Object Re-Identification achiev

569 Dec 30, 2022
Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

trRosetta - Pytorch (wip) Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

Phil Wang 67 Dec 17, 2022
Self-Supervised Contrastive Learning of Music Spectrograms

Self-Supervised Music Analysis Self-Supervised Contrastive Learning of Music Spectrograms Dataset Songs on the Billboard Year End Hot 100 were collect

27 Dec 10, 2022
Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Language Emergence in Multi Agent Dialog Code for the Paper Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog Satwik Kottur, José M.

Karan Desai 105 Nov 25, 2022
Official repository for "Intriguing Properties of Vision Transformers" (2021)

Intriguing Properties of Vision Transformers Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, & Ming-Hsuan Yang P

Muzammal Naseer 155 Dec 27, 2022
Short and long time series classification using convolutional neural networks

time-series-classification Short and long time series classification via convolutional neural networks In this project, we present a novel framework f

35 Oct 22, 2022