NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Overview

pretrain4ir_tutorial

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

用作NLPIR实验室, Pre-training for IR方向入门.

代码包括了如下部分:

  • tasks/ : 生成预训练数据
  • pretrain/: 在生成的数据上Pre-training (MLM + NSP)
  • finetune/: Fine-tuning on MS MARCO

Preinstallation

First, prepare a Python3 environment, and run the following commands:

  git clone [email protected]:zhengyima/pretrain4ir_tutorial.git pretrain4ir_tutorial
  cd pretrain4ir_tutorial
  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Besides, you should download the BERT model checkpoint in format of huggingface transformers, and save them in a directory BERT_MODEL_PATH. In our paper, we use the version of bert-base-uncased. you can download it from the huggingface official model zoo, or Tsinghua mirror.

生成预训练数据

代码库提供了最简单易懂的预训练任务 rand。该任务随机从文档中选取1~5个词作为query, 用来demo面向IR的预训练。

生成rand预训练任务数据命令: cd tasks/rand && bash gen.sh

你可以自己编写脚本, 仿照rand任务, 生成你自己认为合理的预训练任务的数据。

Notes: 运行rand任务的shell之前, 你需要先将 gen.sh 脚本中的 msmarco_docs_path 参数改为MSMARCO数据集的 文档tsv 路径; 将bert_model参数改为下载好的bert模型目录;

模型预训练

代码库提供了模型预训练的相关代码, 见pretrain。该代码完成了MLM+NSP两个任务的预训练。

模型预训练命令: cd pretrain && bash train_bert.sh

Notes: 注意要修改train_bert中的相应参数:将bert_model参数改为下载好的bert模型目录; train_file改为你上一步生成好的预训练数据文件路径。

模型Fine-tune

代码库提供了在MSMARCO Document Ranking任务上进行Fine-tune的相关代码。见finetune。该代码完成了在MSMARCO上通过point-wise进行fine-tune的流程。

模型fine-tune命令: cd finetune && bash train_bert.sh

Leaderboard

Tasks [email protected] on dev set
PROP-MARCO 0.4201
PROP-WIKI 0.4188
BERT-Base 0.4184
rand 0.4123

Homework

设计一个你认为合理的预训练任务, 并对BERT模型进行预训练, 并在MSMARCO上完成fine-tune, 在Leaderboard上更新你在dev set上的结果。

你需要做的是:

  • 编写你自己的预训练数据生成脚本, 放到 tasks/yourtask 目录下。
  • 使用以上脚本, 生成自己的预训练数据。
  • 运行代码库提供的pre-train与fine-tune脚本, 跑出结果, 更新Leaderboard。

Links

Owner
ZYMa
Master candidate. IR and NLP.
ZYMa
Python package for Turkish Language.

PyTurkce Python package for Turkish Language. Documentation: https://pyturkce.readthedocs.io. Installation pip install pyturkce Usage from pyturkce im

Mert Cobanov 14 Oct 09, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Blender addon - Scrub timeline from viewport with a shortcut

Viewport scrub timeline Move in the timeline directly in viewport and snap to nearest keyframe Note : This standalone feature will be added in the nat

Samuel Bernou 40 Nov 07, 2022
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

Feast 52 Jan 05, 2023
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 754 Jan 09, 2023
Saptak Bhoumik 14 May 24, 2022
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
SciBERT is a BERT model trained on scientific text.

SciBERT is a BERT model trained on scientific text.

AI2 1.2k Dec 24, 2022
Use fastai-v2 with HuggingFace's pretrained transformers

FastHugs Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task: Text classification: fasthugs_seq_c

Morgan McGuire 111 Nov 16, 2022
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 04, 2022
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Persian Lexicon This repo uses Uppsala Persian Corpus (UPC) to construct a lexic

Saman Vaisipour 7 Apr 01, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
Using BERT-based models for toxic span detection

SemEval 2021 Task 5: Toxic Spans Detection: Task: Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/2

Ravika Nagpal 1 Jan 04, 2022
Finetune gpt-2 in google colab

gpt-2-colab finetune gpt-2 in google colab sample result (117M) from retraining on A Tale of Two Cities by Charles Di

212 Jan 02, 2023
A fast, efficient universal vector embedding utility package.

Magnitude: a fast, simple vector embedding utility library A feature-packed Python package and vector storage file format for utilizing vector embeddi

Plasticity 1.5k Jan 02, 2023