使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Overview

Pretrain_Bert_with_MaskLM

Info

使用Mask LM预训练任务来预训练Bert模型。

基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。

Pretraining Task

Mask Language Model,简称Mask LM,即基于Mask机制的预训练语言模型。

同时支持 原生的MaskLM任务和Whole Words Masking任务。默认使用Whole Words Masking

MaskLM

使用来自于Bert的mask机制,即对于每一个句子中的词(token):

  • 85%的概率,保留原词不变
  • 15%的概率,使用以下方式替换
    • 80%的概率,使用字符[MASK],替换当前token。
    • 10%的概率,使用词表随机抽取的token,替换当前token。
    • 10%的概率,保留原词不变。

Whole Words Masking

与MaskLM类似,但是在mask的步骤有些少不同。

在Bert类模型中,考虑到如果单独使用整个词作为词表的话,那词表就太大了。不利于模型对同类词的不同变种的特征学习,故采用了WordPiece的方式进行分词。

Whole Words Masking的方法在于,在进行mask操作时,对象变为分词前的整个词,而非子词。

Model

使用原生的Bert模型作为基准模型。

Datasets

项目里的数据集来自wikitext,分成两个文件训练集(train.txt)和测试集(test.txt)。

数据以行为单位存储。

若想要替换成自己的数据集,可以使用自己的数据集进行替换。(注意:如果是预训练中文模型,需要修改配置文件Config.py中的self.initial_pretrain_modelself.initial_pretrain_tokenizer,将值修改成 bert-base-chinese

自己的数据集不需要做mask机制处理,代码会处理。

Training Target

本项目目的在于基于现有的预训练模型参数,如google开源的bert-base-uncasedbert-base-chinese等,在垂直领域的数据语料上,再次进行预训练任务,由此提升bert的模型表征能力,换句话说,也就是提升下游任务的表现。

Environment

项目主要使用了Huggingface的datasetstransformers模块,支持CPU、单卡单机、单机多卡三种模式。

可通过以下命令安装依赖包

    pip install -r requirement.txt

主要包含的模块如下:

    python3.6
    torch==1.3.0
    tqdm==4.61.2
    transformers==4.6.1
    datasets==1.10.2
    numpy==1.19.5
    pandas==1.1.3

Get Start

单卡模式

直接运行以下命令

    python train.py

或修改Config.py文件中的变量self.cuda_visible_devices为单卡后,运行

    chmod 755 run.sh
    ./run.sh

多卡模式

如果你足够幸运,拥有了多张GPU卡,那么恭喜你,你可以进入起飞模式。 🚀 🚀

(1)使用torch的nn.parallel.DistributedDataParallel模块进行多卡训练。其中config.py文件中参数如下,默认可以不用修改。

  • self.cuda_visible_devices表示程序可见的GPU卡号,示例:1,2→可在GPU卡号为1和2上跑,亦可以改多张,如0,1,2,3
  • self.device在单卡模式,表示程序运行的卡号;在多卡模式下,表示master的主卡,默认会变成你指定卡号的第一张卡。若只有cpu,那么可修改为cpu
  • self.port表示多卡模式下,进程通信占用的端口号。(无需修改)
  • self.init_method表示多卡模式下进程的通讯地址。(无需修改)
  • self.world_size表示启动的进程数量(无需修改)。在torch==1.3.0版本下,只需指定一个进程。在1.9.0以上,需要与GPU数量相同。

(2)运行程序启动命令

    chmod 755 run.sh
    ./run.sh

Experiment

使用交叉熵(cross-entropy)作为损失函数,困惑度(perplexity)和Loss作为评价指标来进行训练,训练过程如下:

Reference

【Bert】https://arxiv.org/pdf/1810.04805.pdf

【transformers】https://github.com/huggingface/transformers

【datasets】https://huggingface.co/docs/datasets/quicktour.html

Owner
Desmond Ng
NLP Engineer
Desmond Ng
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

84 Dec 15, 2022
Graphical user interface for Argos Translate

Argos Translate GUI Website | GitHub | PyPI Graphical user interface for Argos Translate. Install pip3 install argostranslategui

Argos Open Tech 16 Dec 07, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

Matthias 479 Jan 01, 2023
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 09, 2021
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Jungil Kong 1.1k Jan 02, 2023
Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Patience-based Early Exit Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit". NEWS: We now have a better and tidier i

Kevin Canwen Xu 54 Jan 04, 2023
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Yu Zhang 50 Nov 08, 2022
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022
An easier way to build neural search on the cloud

An easier way to build neural search on the cloud Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g

Jina AI 17.1k Jan 09, 2023
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Jianjie(JJ) Luo 13 Jan 06, 2023
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022