Nested Named Entity Recognition for Chinese Biomedical Text

Overview

CBio-NAMER

CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understanding Evaluation), a benchmark of Nested Named Entity Recognition. We got the 2nd price of the benchmark by 2021/12/07. Single model CBioNAMER also achieves top20 in CBLUE. The score of CBioNAMER has surpassed human(67.0 in F1-score​).

Result

Results of our method:

ensemble

Results of our single model CBioNAMER:

single

Approach

CBioNAMER is a sub-model in our result, which is based on GlobalPointer (a powerful open-source model, thanks for author, we rewrite it with Pytorch) and MacBert.

Usage

First, install PyTorch>=1.7.0. There's no restriction on GPU or CUDA.

Then, install this repo as a Python package:

$ pip install CBioNAMER

Python package transformers==4.6.1 would be automatically installed as well.

API

The CBioNAMER package provides the following methods:

CBioNAMER.load_NER(model_save_path='./checkpoint/macbert-large_dict.pth', maxlen=512, c_size=9, id2c=_id2c, c2c=_c2c)

Returns the pretrained model. It will download the model as necessary. The model would use the first CUDA device if there's any, otherwise using CPU instead.

The model_save_path argument specifies the path of the pretrained model weight.

The maxlen argument specifies the max length of input sentences. The sentences longer than maxlen would be cut off.

The c_size argument specifies the number of entity class. Here is 9 for CBLUE.

The id2c argument specifies the mapping between id and entity class. By default, the id2c argument for CBLUE is:

_id2c = {0: 'dis', 1: 'sym', 2: 'pro', 3: 'equ', 4: 'dru', 5: 'ite', 6: 'bod', 7: 'dep', 8: 'mic'}

The c2c argument specifies the mapping between entity class and its Chinese meaning. By default, the c2c argument for CBLUE is:

_c2c = {'dis': "疾病", 'sym': "临床表现", 'pro': "医疗程序", 'equ': "医疗设备", 'dru': "药物", 'ite': "医学检验项目", 'bod': "身体", 'dep': "科室", 'mic': "微生物类"}


The model returned by CBioNAMER.load_NER() supports the following methods:

model.recognize(text: str, threshold=0)

Given a sentence, returns a list of dictionaries with recognized entity, the format of the dictionary is {'start_idx': entity's starting index, 'end_idx': entity's ending index, 'type': entity class, 'Chinese_type': Chinese meaning of entity class, 'entity': recognized entity}. The threshold argument specifies that the returned list only contains the recognized entity with confidence score higher than threshold.

model.predict_to_file(in_file: str, out_file: str)

Given input and output .json file path, the model would do inference according in_file, and the recognized entity would be saved in out_file. The output file can be submitted to CBLUE. The format of input file is like:

[
  {
    "text": "该技术的应用使某些遗传病的诊治水平得到显著提高。"
  },
    ...
  {
    "text": "There is a sentence."
  }
]

Examples

import CBioNAMER

NER = CBioNAMER.load_NER()
in_file = './CMeEE_test.json'
out_file = './CMeEE_test_answer.json'
NER.predict_to_file(in_file, out_file)
import CBioNAMER

NER = CBioNAMER.load_NER()
text = "该技术的应用使某些遗传病的诊治水平得到显著提高。"
recognized_entity = NER.recognize(text)
print(recognized_entity)
# output:[{'start_idx': 9, 'end_idx': 11, 'type': 'dis', 'Chinese_type': '疾病', 'entity': '遗传病'}]
You might also like...
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Pytorch-Named-Entity-Recognition-with-BERT
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Releases(v0.0.1)
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

Tae-Hwan Jung 775 Jan 08, 2023
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
Image2pcl - Enter the metaverse with 2D image to 3D projections

Image2PCL Enter the metaverse with 2D image to 3D projections! This is an implem

Benjamin Ho 0 Feb 05, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Jianjie(JJ) Luo 13 Jan 06, 2023
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

Machine Translation @ UPC 21 Dec 20, 2022
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 02, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
ChatBotProyect - This is an unfinished project about a simple chatbot.

chatBotProyect This is an unfinished project about a simple chatbot. (union_todo.ipynb) Reminders for the project: Find why one of the vectorizers fai

Tomás 0 Jul 24, 2022
Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

Donghoon Shin 3 Dec 02, 2022
Natural Language Processing Tasks and Examples.

Natural Language Processing Tasks and Examples With the advancement of A.I. technology in recent years, natural language processing technology has bee

Soohwan Kim 53 Dec 20, 2022
The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

Deepangshi 1 Apr 03, 2022
Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models 🌲🌱

Recognai 65 Sep 13, 2022