This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' published at ECIR'22.

Related tags

Deep Learningparm
Overview

Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval

This repository contains the code for the paper PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval and is partly based on the DPR Github repository. PARM is a Paragraph Aggregation Retrieval Model for dense document-to-document retrieval tasks, which liberates dense passage retrieval models from their limited input lenght and does retrieval on the paragraph-level.

We focus on the task of legal case retrieval and train and evaluate our models on the COLIEE 2021 data and evaluate our models on the CaseLaw collection.

The dense retrieval models are trained on the COLIEE data and can be found here. For training the dense retrieval model we utilize the DPR Github repository.

PARM Workflow

If you use our models or code, please cite our work:

@inproceedings{althammer2022parm,
      title={Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval}, 
      author={Althammer, Sophia and Hofstätter, Sebastian and Sertkan, Mete and Verberne, Suzan and Hanbury, Allan},
      year={2022},
      booktitle={Advances in Information Retrieval, 44rd European Conference on IR Research, ECIR 2022},
}

Training the dense retrieval model

The dense retrieval models need to be trained, either on the paragraph-level data of COLIEE Task2 or additionally on the document-level data of COLIEE Task1

  • ./DPR/train_dense_encoder.py: trains the dense bi-encoder (Step1)
python -m torch.distributed.launch --nproc_per_node=2 train_dense_encoder.py 
--max_grad_norm 2.0 
--encoder_model_type hf_bert 
--checkpoint_file_name --insert path to pretrained encoder checkpoint here if available-- 
--model_file  --insert path to pretrained chechpoint here if available-- 
--seed 12345 
--sequence_length 256 
--warmup_steps 1237 
--batch_size 22 
--do_lower_case 
--train_file --path to json train file-- 
--dev_file --path to json val file-- 
--output_dir --path to output directory--
--learning_rate 1e-05
--num_train_epochs 70
--dev_batch_size 22
--val_av_rank_start_epoch 60
--eval_per_epoch 1
--global_loss_buf_sz 250000

Generate dense embeddings index with trained DPR model

  • ./DPR/generate_dense_embeddings.py: encodes the corpus in the dense index (Step2)
python generate_dense_embeddings.py
--model_file --insert path to pretrained checkpoint here from Step1--
--pretrained_file  --insert path to pretrained chechpoint here from Step1--
--ctx_file --insert path to tsv file with documents in the corpus--
--out_file --insert path to output index--
--batch_size 750

Search in the dense index

  • ./DPR/dense_retriever.py: searches in the dense index the top n-docs (Step3)
python dense_retriever.py 
--model_file --insert path to pretrained checkpoint here from Step1--
--ctx_file --insert path to tsv file with documents in the corpus--
--qa_file --insert path to csv file with the queries--
--encoded_ctx_file --path to the dense index (.pkl format) from Step2--
--out_file --path to .json output file for search results--
--n-docs 1000

Poolout dense vectors for aggregation step

First you need to get the dense embeddings for the query paragraphs:

  • ./DPR/get_question_tensors.py: encodes the query paragraphs with the dense encoder checkpoint and stores the embeddings in the output file (Step4)
python get_question_tensors.py
--model_file --insert path to pretrained checkpoint here from Step1--
--qa_file --insert path to csv file with the queries--
--out_file --path to output file for output index--

Once you have the dense embeddings of the paragraphs in the index and of the questions, you do the vector-based aggregation step in PARM with VRRF (alternatively with Min, Max, Avg, Sum, VScores, VRanks) and evaluate the aggregated results

  • ./representation_aggregation.py: aggregates the run, stores and evaluates the aggregated run (Step5)
python representation_aggregation.py
--encoded_ctx_file --path to the encoded index (.pkl format) from Step2--
--encoded_qa_file  --path to the encoded queries (.pkl format) from Step4--
--output_top1000s --path to the top-n file (.json format) from Step3--
--label_file  --path to the label file (.json format)--
--aggregation_mode --choose from vrrf/vscores/vranks/sum/max/min/avg
--candidate_mode p_from_retrieved_list
--output_dir --path to output directory--
--output_file_name  --output file name--

Preprocessing

Preprocess COLIEE Task 1 data for dense retrieval

  • ./preprocessing/preprocess_coliee_2021_task1.py: preprocess the COLIEE Task 1 dataset by removing non-English text, removing non-informative summaries, removing tabs etc

Preprocess CaseLaw collection

  • ./preprocessing/caselaw_stat_corpus.py: preprocess the CaseLaw collection

Preprocess data for training the dense retrieval model

In order to train the dense retrieval models, the data needs to be preprocessed. For training and retrieval we split up the documents into their paragraphs.

  • ./preprocessing/preprocess_finetune_data_dpr_task1.py: preprocess the COLIEE Task 1 document-level labels for training the DPR model

  • ./preprocessing/preprocess_finetune_data_dpr.py: preprocess the COLIEE Task 2 paragraph-level labels for training the DPR model

Owner
Sophia Althammer
PhD student @TuVienna Interested in IR and NLP https://sophiaalthammer.github.io/ Currently working on the dossier project to https://dossier-project.eu/
Sophia Althammer
Wileless-PDGNet Implementation

Wileless-PDGNet Implementation This repo is related to the following paper: Boning Li, Ananthram Swami, and Santiago Segarra, "Power allocation for wi

6 Oct 04, 2022
An implementation of RetinaNet in PyTorch.

RetinaNet An implementation of RetinaNet in PyTorch. Installation Training COCO 2017 Pascal VOC Custom Dataset Evaluation Todo Credits Installation In

Conner Vercellino 297 Jan 04, 2023
PyTorch implementation of federated learning framework based on the acceleration of global momentum

Federated Learning with Acceleration of Global Momentum PyTorch implementation of federated learning framework based on the acceleration of global mom

0 Dec 23, 2021
EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling This is the official implementation for "Frustratingly Simple Pretraining Al

Atsuki Yamaguchi 31 Nov 18, 2022
A set of tests for evaluating large-scale algorithms for Wasserstein-2 transport maps computation.

Continuous Wasserstein-2 Benchmark This is the official Python implementation of the NeurIPS 2021 paper Do Neural Optimal Transport Solvers Work? A Co

Alexander 22 Dec 12, 2022
WSDM2022 Challenge - Large scale temporal graph link prediction

WSDM 2022 Large-scale Temporal Graph Link Prediction - Baseline and Initial Test Set WSDM Cup Website link Link to this challenge This branch offers A

Deep Graph Library 34 Dec 29, 2022
PyG (PyTorch Geometric) - A library built upon PyTorch to easily write and train Graph Neural Networks (GNNs)

PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

PyG 16.5k Jan 08, 2023
2021-MICCAI-Progressively Normalized Self-Attention Network for Video Polyp Segmentation

2021-MICCAI-Progressively Normalized Self-Attention Network for Video Polyp Segmentation Authors: Ge-Peng Ji*, Yu-Cheng Chou*, Deng-Ping Fan, Geng Che

Ge-Peng Ji (Daniel) 85 Dec 30, 2022
Totally Versatile Miscellanea for Pytorch

Totally Versatile Miscellania for PyTorch Thomas Viehmann [email protected] Thi

Thomas Viehmann 428 Dec 28, 2022
Implementation of SegNet: A Deep Convolutional Encoder-Decoder Architecture for Semantic Pixel-Wise Labelling

Caffe SegNet This is a modified version of Caffe which supports the SegNet architecture As described in SegNet: A Deep Convolutional Encoder-Decoder A

Alex Kendall 1.1k Jan 02, 2023
AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

Frank Liu 26 Oct 13, 2022
Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

COResets and Data Subset selection Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order

decile-team 244 Jan 09, 2023
High-fidelity 3D Model Compression based on Key Spheres

High-fidelity 3D Model Compression based on Key Spheres This repository contains the implementation of the paper: Yuanzhan Li, Yuqi Liu, Yujie Lu, Siy

5 Oct 11, 2022
Adaptive Attention Span for Reinforcement Learning

Adaptive Transformers in RL Official implementation of Adaptive Transformers in RL In this work we replicate several results from Stabilizing Transfor

100 Nov 15, 2022
PyTorch implementation of UNet++ (Nested U-Net).

PyTorch implementation of UNet++ (Nested U-Net) This repository contains code for a image segmentation model based on UNet++: A Nested U-Net Architect

4ui_iurz1 642 Jan 04, 2023
MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research

MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research

Facebook Research 338 Dec 29, 2022
🦙 LaMa Image Inpainting, Resolution-robust Large Mask Inpainting with Fourier Convolutions, WACV 2022

🦙 LaMa Image Inpainting, Resolution-robust Large Mask Inpainting with Fourier Convolutions, WACV 2022

Advanced Image Manipulation Lab @ Samsung AI Center Moscow 4.7k Dec 31, 2022
Pytorch implementation of paper "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery"

SegSwap Pytorch implementation of paper "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery" [PDF] [Project page] If our project

xshen 41 Dec 10, 2022
Source code for the plant extraction workflow introduced in the paper “Agricultural Plant Cataloging and Establishment of a Data Framework from UAV-based Crop Images by Computer Vision”

Plant extraction workflow Source code for the plant extraction workflow introduced in the paper "Agricultural Plant Cataloging and Establishment of a

Maurice Günder 0 Apr 22, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023