The repo for reproducing Seed-driven Document Ranking for Systematic Reviews: A Reproducibility Study

Related tags

Deep Learningsdr
Overview

ECIR Reproducibility Paper: Seed-driven Document Ranking for Systematic Reviews: A Reproducibility Study

This code corresponds to the reproducibility paper: "Seed-driven Document Ranking for Systematic Reviews: A Reproducibility Study" and all results gathered from the paper are generated using the code.

Environment setup:

  • This project is implemented and tested only for python version 3.6.12, other python versions are not tested and can not ensure the full run of the results.

First please install the required packages:

pip3 install -r requirements.txt

Query&Eval generation:

First please clone the TAR repository using the command

git clone https://github.com/CLEF-TAR/tar.git

The data that's been used include the following files:

For 2017:
tar/tree/master/2017-TAR/training/qrels/qrel_content_train
tar/tree/master/2017-TAR/testing/qrels/qrel_content_test.txt
Please cat these two files together to make 2017_full.txt

For 2018:
tar/tree/master/2018-TAR/Task2/Training/qrels/full.train.content.2018.qrels
tar/tree/master/2018-TAR/Task2/Testing/qrels/full.test.content.2018.qrels
Please cat these two files together to make 2018_full.txt

For 2019:
tar/tree/master/2019-TAR/Task2/Training/Intervention/qrels/full.train.int.content.2019.qrels
tar/tree/master/2019-TAR/Task2/Testing/Intervention/qrels/full.test.int.content.2019.qrels
Please cat these two files together to make 2019_full.txt, and also 2019_test.txt (note for 2019 these two will be the same)

Then you can generate query and evaluation files by:

For snigle:
python3 topic_query_generation.py --input_qrel qrel_file_for_training+testing --input_test_qrel qrel_file_for_testing --DATA_DIR output_dir

For multiple:
python3 topic_query_generation_multiple.py --input_qrel qrel_file_for_training+testing --input_test_qrel qrel_file_for_testing --DATA_DIR output_dir

Please note: you need to generate for each year and put it in a separate folder, not the overall one.

Collection generation:

For BOW collection generation, the following command is needed

python3 gather_all_pids.py --filenames 2017_full.txt+2018_full.txt+2019_full.txt --output_dir collection/pid_dir --chunks n
python3 collection_gathering.py --filename yourpidsfile --email [email protected] --output output_collection
python3 collection_processing.py --input_collection acquired_collection_file --output_collection processed_file(default is weighted1_bow.jsonl)

Then for BOC collection generation:

  • First ensure to check Quickumls to gather umls data.
  • Second ensure to register on NCBO to get api keys, and fill in these keys in ncbo_request_word.py
  • For BOC collection then, run the following command to generation boc_collection:
python3 ncbo_request_word.py --input_collection your_generated_bow_collection --num_workers for_multi_procesing --generated_collection output_dir_ncbo
cat output_dir/* > ncbo.tsv
python3 processing_uml.py --input_collection your_bow_collection --input_umls_dir your_output_umls_dir --num_workers for_multi_procesing
python3 processing_umls_word.py --input_collection your_generated_bow_collection --input_umls_dir your_output_umls_dir_from_last_step --output_file umls.tsv
python3 boc_extraction.py --input_collection bow_collection --input_ncbo_collection ncbo.tsv --input_umls_collection umls.tsv --output_collection processed_file(default is weighted1_boc.jsonl)

RQ1: Does the effectiveness of SDR generalise beyond the CLEF TAR 2017 dataset?

For RQ1, single seed driven results are acquired for clef tar 2017, 2018, 2019, for this please run the following command.

bash search.sh 2017_single_data_dir all
bash search.sh 2018_single_data_dir test
bash search.sh 2019_single_data_dir test

to get the run_file of all three years single seed run_file with all methods.

Then evaluation by:

bash evaluation_full.sh 2017_single_data_dir all
bash evaluation_full.sh 2018_single_data_dir test
bash evaluation_full.sh 2019_single_data_dir test

to print out evaluation measures and also save evaluation measurement files in the corresponding eval folder

RQ2: What is the impact of using multiple seed studies collectively on the effectiveness of SDR?

For RQ2, multiple seed driven results are acquired for clef tar 2017, 2018, 2019, for this please run the following command.

bash search_multiple.sh 2017_multiple_data_dir all
bash search_multiple.sh 2018_multiple_data_dir test
bash search_multiple.sh 2019_multiple_data_dir test

to get the run_file of all three years multiple seed run_file with all methods.

Then evaluation by:

bash evaluation_full.sh 2017_multiple_data_dir all
bash evaluation_full.sh 2018_multiple_data_dir test
bash evaluation_full.sh 2019_multiple_data_dir test

to print out evaluation measures and also save evaluation measurement files in the corresponding eval folder

RQ3: To what extent do seed studies impact the ranking stability of single- and multi-SDR?

For this question, we need to use the results acquired from the last two steps, in which we can generate variability graphs by using the following command:

python3 graph_making/distribution_graph.py --year 2017 --type oracle 
python3 graph_making/distribution_graph.py --year 2018 --type oracle 
python3 graph_making/distribution_graph.py --year 2019 --type oracle 

to get distribution graphs of the three years.

Generated run files:

Run files are generated and stored in here, feel free to download for verification or futher research needs.

Example:
run_files/2017/all: 2017 single seed results file
run_files/2017/multiple: 2017 multiple seed results file

Owner
ielab
The Information Engineering Lab
ielab
Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation"

DSP Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation". Accepted by ACM Multimedia 2021. Authors

20 Oct 24, 2022
The world's largest toxicity dataset.

The Toxicity Dataset by Surge AI Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's wh

Surge AI 134 Dec 19, 2022
Data and code for the paper "Importance of Kernel Bandwidth in Quantum Machine Learning"

Reproducibility materials for "Importance of Kernel Bandwidth in Quantum Machine Learning" Repo structure: code contains Python scripts used to genera

Ruslan Shaydulin 3 Oct 23, 2022
The repo of Feedback Networks, CVPR17

Feedback Networks http://feedbacknet.stanford.edu/ Paper: Feedback Networks, CVPR 2017. Amir R. Zamir*,Te-Lin Wu*, Lin Sun, William B. Shen, Bertram E

Stanford Vision and Learning Lab 87 Nov 19, 2022
SciFive: a text-text transformer model for biomedical literature

SciFive SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the pape

Long Phan 54 Dec 24, 2022
Source Code for ICSE 2022 Paper - ``Can We Achieve Fairness Using Semi-Supervised Learning?''

Fair-SSL Source Code for ICSE 2022 Paper - Can We Achieve Fairness Using Semi-Supervised Learning? Ethical bias in machine learning models has become

1 Dec 18, 2021
A PyTorch implementation of the architecture of Mask RCNN

EDIT (AS OF 4th NOVEMBER 2019): This implementation has multiple errors and as of the date 4th, November 2019 is insufficient to be utilized as a reso

Sai Himal Allu 975 Dec 30, 2022
Deep Networks with Recurrent Layer Aggregation

RLA-Net: Recurrent Layer Aggregation Recurrence along Depth: Deep Networks with Recurrent Layer Aggregation This is an implementation of RLA-Net (acce

Joy Fang 21 Aug 16, 2022
An Straight Dilated Network with Wavelet for image Deblurring

SDWNet: A Straight Dilated Network with Wavelet Transformation for Image Deblurring(offical) 1. Introduction This repo is not only used for our paper(

FlyEgle 41 Jan 04, 2023
Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations

Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations Code repo for paper Trans-Encoder: Unsupervised sentence-pa

Amazon 101 Dec 29, 2022
Python Fanduel API (2021) - Lineup Automation

Southpaw is a python package that provides access to the Fanduel API. Optimize your DFS experience by programmatically updating your lineups, analyzin

Brandin Canfield 13 Jan 04, 2023
Prml - Repository of notes, code and notebooks in Python for the book Pattern Recognition and Machine Learning by Christopher Bishop

Pattern Recognition and Machine Learning (PRML) This project contains Jupyter notebooks of many the algorithms presented in Christopher Bishop's Patte

Gerardo Durán-Martín 1k Jan 07, 2023
This code is for our paper "VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers"

ICCV Workshop 2021 VTGAN This code is for our paper "VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers"

Sharif Amit Kamran 25 Dec 08, 2022
A library for researching neural networks compression and acceleration methods.

A library for researching neural networks compression and acceleration methods.

Intel Labs 100 Dec 29, 2022
Transfer Learning Remote Sensing

Transfer_Learning_Remote_Sensing Simulation R codes for data generation and visualizations are in the folder simulation. Experiment: California Housin

2 Jun 21, 2022
Keras implementation of Real-Time Semantic Segmentation on High-Resolution Images

Keras-ICNet [paper] Keras implementation of Real-Time Semantic Segmentation on High-Resolution Images. Training in progress! Requisites Python 3.6.3 K

Aitor Ruano 87 Dec 16, 2022
Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

Lux AI 2021 python game engine and gym This is a replica of the Lux AI 2021 game ported directly over to python. It also sets up a classic Reinforceme

Geoff McDonald 74 Nov 03, 2022
3D Human Pose Machines with Self-supervised Learning

3D Human Pose Machines with Self-supervised Learning Keze Wang, Liang Lin, Chenhan Jiang, Chen Qian, and Pengxu Wei, “3D Human Pose Machines with Self

Chenhan Jiang 398 Dec 20, 2022
Pytorch implementation of the AAAI 2022 paper "Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification"

[AAAI22] Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification We point out the overlooked unbiasedness in long-tailed clas

PatatiPatata 28 Oct 18, 2022
EGNN - Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

EGNN - Pytorch Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch. May be eventually used for Alphafold2 replication. This

Phil Wang 259 Jan 04, 2023