PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Overview

Feature_CRF_AE

Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging:

@inproceedings{zhou-etal-2022-Bridging,
  title     = {Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging},
  author    = {Zhou, houquan and Li, yang and Li, Zhenghua and Zhang Min},
  booktitle = {Findings of ACL},
  year      = {2022},
  url       = {?},
  pages     = {?--?}
}

Please concact Jacob_Zhou \at outlook.com if you have any questions.

Contents

Installation

Feature_CRF_AE can be installing from source:

$ git clone https://github.com/Jacob-Zhou/FeatureCRFAE && cd FeatureCRFAE
$ bash scripts/setup.sh

The following requirements will be installed in scripts/setup.sh:

  • python: 3.7
  • allennlp: 1.2.2
  • pytorch: 1.6.0
  • transformers: 3.5.1
  • h5py: 3.1.0
  • matplotlib: 3.3.1
  • nltk: 3.5
  • numpy: 1.19.1
  • overrides: 3.1.0
  • scikit_learn: 1.0.2
  • seaborn: 0.11.0
  • tqdm: 4.49.0

For WSJ data, we use the ELMo representations of elmo_2x4096_512_2048cnn_2xhighway_5.5B from AllenNLP. For UD data, we use the ELMo representations released by HIT-SCIR.

The corresponding data and ELMo models can be download as follows:

# 1) UD data and ELMo models:
$ bash scripts/prepare_data.sh
# 2) UD data, ELMo models as well as WSJ data 
#    [please replace ~/treebank3/parsed/mrg/wsj/ with your path to LDC99T42]
$ bash scripts/prepare_data.sh ~/treebank3/parsed/mrg/wsj/

Performance

WSJ-All

Seed M-1 1-1 VM
0 84.29 70.03 78.43
1 82.34 64.42 77.27
2 84.68 62.78 77.83
3 82.55 65.00 77.35
4 82.20 66.69 77.33
Avg. 83.21 65.78 77.64
Std. 1.18 2.75 0.49

WSJ-Test

Seed M-1 1-1 VM
0 81.99 64.84 76.86
1 82.52 61.46 76.13
2 82.33 61.15 75.13
3 78.11 58.80 72.94
4 82.05 61.68 76.21
Avg. 81.40 61.59 75.45
Std. 1.85 2.15 1.54

Usage

We give some examples on scripts/examples.sh. Before run the code you should activate the virtual environment by:

$ . scripts/set_environment.sh

Training

To train a model from scratch, it is preferred to use the command-line option, which is more flexible and customizable. Here are some training examples:

$ python -u -m tagger.cmds.crf_ae train \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --train data/wsj/total.conll \
    --evaluate data/wsj/total.conll \
    --path save/crf_ae_wsj
$ python -u -m tagger.cmds.crf_ae train \
    --conf configs/crf_ae.ini \
    --ud-mode \
    --ud-feature \
    --ignore-capitalized \
    --language-specific-strip \
    --feat-min-freq 14 \
    --language de \
    --encoder elmo \
    --plm elmo_models/de \
    --train data/ud/de/total.conll \
    --evaluate data/ud/de/total.conll \
    --path save/crf_ae_de

For more instructions on training, please type python -m tagger.cmds.[crf_ae|feature_hmm] train -h.

Alternatively, We provides some equivalent command entry points registered in setup.py: crf-ae and feature-hmm.

$ crf-ae train \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --train data/wsj/total.conll \
    --evaluate data/wsj/total.conll \
    --path save/crf_ae

Evaluation

$ python -u -m tagger.cmds.crf_ae evaluate \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --data data/wsj/total.conll \
    --path save/crf_ae

Predict

$ python -u -m tagger.cmds.crf_ae predict \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --data data/wsj/total.conll \
    --path save/crf_ae \
    --pred save/crf_ae/pred.conll
Owner
Jacob Zhou
Jacob Zhou
A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

NEC Laboratories Europe 13 Sep 08, 2022
使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。 基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。 Pretraining Task Mask Language Model,简称Mask LM,即

Desmond Ng 24 Dec 10, 2022
Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

SeMI Technologies 191 Dec 26, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

New & (hopefully) Improved CYGNUS with several API updates, user updates, and online/offline operations added!!!

Simran Farrukh 0 Mar 28, 2022
PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

YangHeng 567 Jan 07, 2023
AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

AI Dynamic Text Reader: This is a simple dynamic text reader based on Artificial

Md. Rakibul Islam 1 Jan 18, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Introduction Funnel-Transformer is a new self-attention model that gradually compresses the sequence of hidden states to a shorter one and hence reduc

GUOKUN LAI 197 Dec 11, 2022
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Nov 16, 2022
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
NLP codes implemented with Pytorch (w/o library such as huggingface)

NLP_scratch NLP codes implemented with Pytorch (w/o library such as huggingface) scripts ├── models: Neural Network models ├── data: codes for dataloa

3 Dec 28, 2021
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

Greg Ver Steeg 592 Dec 18, 2022