Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Overview

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Project Structure

./
├── DataProcess
│   ├── __pycache__
│   ├── convert2bio.py
│   ├── convert_jsonl.py
│   ├── handle_numbers.py
│   ├── load_data.py
│   └── statistic.py
├── README.md
├── __pycache__
├── chinese_L-12_H-768_A-12                                    BERT权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── chinese_bert_wwm                                           BERT_wwm权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── chinese_macbert_base                                       macBERT权重
│   ├── chinese_macbert_base.ckpt.data-00000-of-00001
│   ├── chinese_macbert_base.ckpt.index
│   ├── chinese_macbert_base.ckpt.meta
│   ├── macbert_base_config.json
│   └── vocab.txt
├── chinese_roberta_wwm_ext_L-12_H-768_A-12                    roberta权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── config                                                     
│   ├── __pycache__
│   ├── config.py                                              配置文件
│   └── pulmonary_label2id.json                                label id
├── data                                                       数据集
│   ├── pulmonary.test
│   ├── pulmonary.train
│   └── sict_train.txt
├── environment.yaml                                           conda环境配置文件
├── evaluate.py
├── generator_train.py
├── keras_bert                                                 keras_bert(可pip下)
├── keras_contrib                                              keras_contrib(可pip下)
├── log                                                        训练nohup日志
│   ├── chinese_L-12_H-768_A-12.out
│   ├── chinese_macbert_base.out
│   ├── chinese_roberta_wwm_ext_L-12_H-768_A-12.out
│   └── electra_180g_base.out
├── model.py                                                   模型构建文件
├── models                                                     保存的模型权重
│   ├── pulmonary_chinese_L-12_H-768_A-12_ner.h5
│   ├── pulmonary_chinese_bert_wwm_ner.h5
│   ├── pulmonary_chinese_macbert_base_ner.h5
│   └── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_ner.h5
├── predict.py                                                 预测
├── report                                                     模型实体F1评估报告
│   ├── pulmonary_chinese_L-12_H-768_A-12_evaluate.txt
│   ├── pulmonary_chinese_L-12_H-768_A-12_predict.json
│   ├── pulmonary_chinese_bert_wwm_evaluate.txt
│   ├── pulmonary_chinese_bert_wwm_predict.json
│   ├── pulmonary_chinese_macbert_base_evaluate.txt
│   ├── pulmonary_chinese_macbert_base_predict.json
│   ├── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_evaluate.txt
│   └── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_predict.json
├── requirements.txt                                           pip环境
├── test.py                                                    
├── train.py                                                   训练
└── utils                                                      
    ├── FGM.py                                                 FGM对抗
    ├── __pycache__
    └── path.py                                                所有路径

56 directories, 193 files

Dataset

三甲医院肺结节数据集,20000+字,BIO格式,形如:

中	B-ORG
共	I-ORG
中	I-ORG
央	I-ORG
致	O
中	B-ORG
国	I-ORG
致	I-ORG
公	I-ORG
党	I-ORG
十	I-ORG
一	I-ORG
大	I-ORG
的	O
贺	O
词	O

ATTENTION: 在处理自己数据集的时候需要注意:

  • 字与标签之间用空格("\ ")隔开
  • 其中句子与句子之间使用空行隔开

Steps

  1. 替换数据集
  2. 使用DataProcess/load_data.py生成label2id.txt文件
  3. 修改config/config.py中的MAX_SEQ_LEN(超过截断,少于填充,最好设置训练集、测试集中最长句子作为MAX_SEQ_LEN)
  4. 下载权重,放到项目中
  5. 修改public/path.py中的地址
  6. 根据需要修改model.py模型结构
  7. 修改config/config.py的参数
  8. 训练前debug看下input_train_labels,result_train对不对,input_train_types全是0
  9. 训练

Model

BERT

roberta

macBERT

BERT_wwm

Train

运行train.py

Evaluate

运行evaluate/f1_score.py

BERT

           precision    recall  f1-score   support

     SIGN     0.6651    0.7354    0.6985       189
  ANATOMY     0.8333    0.8409    0.8371       220
 DIAMETER     1.0000    1.0000    1.0000        16
  DISEASE     0.4915    0.6744    0.5686        43
 QUANTITY     0.8837    0.9157    0.8994        83
TREATMENT     0.3571    0.5556    0.4348         9
  DENSITY     1.0000    1.0000    1.0000         8
    ORGAN     0.4500    0.6923    0.5455        13
LUNGFIELD     1.0000    0.5000    0.6667         6
    SHAPE     0.5714    0.5714    0.5714         7
   NATURE     1.0000    1.0000    1.0000         6
 BOUNDARY     1.0000    0.6250    0.7692         8
   MARGIN     0.8333    0.8333    0.8333         6
  TEXTURE     1.0000    0.8571    0.9231         7

micro avg     0.7436    0.7987    0.7702       621
macro avg     0.7610    0.7987    0.7760       621

roberta

           precision    recall  f1-score   support

  ANATOMY     0.8624    0.8545    0.8584       220
  DENSITY     0.8000    1.0000    0.8889         8
     SIGN     0.7347    0.7619    0.7481       189
 QUANTITY     0.8977    0.9518    0.9240        83
  DISEASE     0.5690    0.7674    0.6535        43
 DIAMETER     1.0000    1.0000    1.0000        16
TREATMENT     0.3333    0.5556    0.4167         9
 BOUNDARY     1.0000    0.6250    0.7692         8
LUNGFIELD     1.0000    0.6667    0.8000         6
   MARGIN     0.8333    0.8333    0.8333         6
  TEXTURE     1.0000    0.8571    0.9231         7
    SHAPE     0.5714    0.5714    0.5714         7
   NATURE     1.0000    1.0000    1.0000         6
    ORGAN     0.6250    0.7692    0.6897        13

micro avg     0.7880    0.8261    0.8066       621
macro avg     0.8005    0.8261    0.8104       621

macBERT

           precision    recall  f1-score   support

  ANATOMY     0.8773    0.8773    0.8773       220
     SIGN     0.6538    0.7196    0.6851       189
  DISEASE     0.5893    0.7674    0.6667        43
 QUANTITY     0.9070    0.9398    0.9231        83
    ORGAN     0.5882    0.7692    0.6667        13
  TEXTURE     1.0000    0.8571    0.9231         7
 DIAMETER     1.0000    1.0000    1.0000        16
TREATMENT     0.3750    0.6667    0.4800         9
LUNGFIELD     1.0000    0.5000    0.6667         6
    SHAPE     0.4286    0.4286    0.4286         7
   NATURE     1.0000    1.0000    1.0000         6
  DENSITY     1.0000    1.0000    1.0000         8
 BOUNDARY     1.0000    0.6250    0.7692         8
   MARGIN     0.8333    0.8333    0.8333         6

micro avg     0.7697    0.8180    0.7931       621
macro avg     0.7846    0.8180    0.7977       621

BERT_wwm

           precision    recall  f1-score   support

  DISEASE     0.5667    0.7907    0.6602        43
  ANATOMY     0.8676    0.8636    0.8656       220
 QUANTITY     0.8966    0.9398    0.9176        83
     SIGN     0.7358    0.7513    0.7435       189
LUNGFIELD     1.0000    0.6667    0.8000         6
TREATMENT     0.3571    0.5556    0.4348         9
 DIAMETER     0.9375    0.9375    0.9375        16
 BOUNDARY     1.0000    0.6250    0.7692         8
  TEXTURE     1.0000    0.8571    0.9231         7
   MARGIN     0.8333    0.8333    0.8333         6
    ORGAN     0.5882    0.7692    0.6667        13
  DENSITY     1.0000    1.0000    1.0000         8
   NATURE     1.0000    1.0000    1.0000         6
    SHAPE     0.5000    0.5714    0.5333         7

micro avg     0.7889    0.8245    0.8063       621
macro avg     0.8020    0.8245    0.8104       621

Predict

运行predict/predict_bio.py

MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

SyntaxGen Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022) In this repo, we upload all the scripts for this work. Due to siz

Zhuosheng Zhang 3 Jun 13, 2022
Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

Malwation 62 Oct 02, 2022
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

7 Dec 05, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

Siva Prakash 3 Apr 05, 2022
100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量 中文 This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse),

embedding 10.4k Jan 09, 2023
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
A Telegram bot to add notes to Flomo.

flomo bot 使用 Telegram 机器人发送笔记到你的 Flomo. 你需要有一台可访问 Telegram 的服务器。 Steps @BotFather 新建机器人,获取 token Flomo 官网获取 API,链接 https://flomoapp.com/mine?source=in

Zhen 44 Dec 30, 2022
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
Snips Python library to extract meaning from text

Snips NLU Snips NLU (Natural Language Understanding) is a Python library that allows to extract structured information from sentences written in natur

Snips 3.7k Dec 30, 2022
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the

Galois Autocompleter 91 Sep 23, 2022