SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

Overview

SimpleChinese2

SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

声明

本项目是为方便个人工作所创建的,仅有部分代码原创。包括分词、词云在内的诸多功能来自于其他项目,并非本人所写,如遇问题,请至原项目链接下提问,谢谢!

安装

pip install -U simplechinese==0.2.8

如从 git 上 clone,需要从以下地址下载词向量文件:

https://drive.google.com/file/d/1ltyiTHZk8kIBYQGbZS9GoO_DwDOEWnL9/view?usp=sharing

并拷贝至"./simplechinese/data/"文件夹下

使用方法

import simplechinese as sc

1. 文字预处理

>> print(sc.only_digits(x)) # 仅保留数字 01234 >>> print(sc.only_zh(x)) # 仅保留中文 测试测试测试测试 >>> print(sc.only_en(x)) # 仅保留英文 TestING >>> print(sc.remove_space(x)) # 去除空格 测试测试,TestING;¥%&01234测试测试 >>> print(sc.remove_digits(x)) # 去除数字 测试测试,TestING ;¥%& 测试测试 >>> print(sc.remove_zh(x)) # 去除中文 ,TestING ;¥%& 01234 >>> print(sc.remove_en(x)) # 去除英文 测试测试, ;¥%& 01234测试测试 >>> print(sc.remove_punctuations(x)) # 去除标点符号 测试测试TestING 01234测试测试 >>> print(sc.toLower(x)) # 修改为全小写字母 测试测试,testing ;¥%& 01234测试测试 >>> print(sc.toUpper(x)) # 修改为全大写字母 测试测试,TESTING ;¥%& 01234测试测试 >>> x = "测试,TestING:12345@#【】+=-()。." >>> print(sc.punc_norm(x)) # 将中文标点符号转换成英文标点符号 测试,TestING:12345@#[]+=-().. >>> # y = fillna(df) # 将pandas.DataFrame中的N/A单元格填充为长度为0的str ">
>>> x = "测试测试,TestING    ;¥%& 01234测试测试"

>>> print(sc.only_digits(x))         # 仅保留数字
01234

>>> print(sc.only_zh(x))             # 仅保留中文
测试测试测试测试

>>> print(sc.only_en(x))             # 仅保留英文
TestING

>>> print(sc.remove_space(x))        # 去除空格
测试测试,TestING;¥%&01234测试测试

>>> print(sc.remove_digits(x))       # 去除数字
测试测试,TestING    ;¥%& 测试测试

>>> print(sc.remove_zh(x))           # 去除中文
,TestING    ;¥%& 01234

>>> print(sc.remove_en(x))           # 去除英文
测试测试,    ;¥%& 01234测试测试

>>> print(sc.remove_punctuations(x)) # 去除标点符号
测试测试TestING     01234测试测试

>>> print(sc.toLower(x))             # 修改为全小写字母
测试测试,testing    ;¥%& 01234测试测试

>>> print(sc.toUpper(x))             # 修改为全大写字母
测试测试,TESTING    ;¥%& 01234测试测试

>>> x = "测试,TestING:12345@#【】+=-()。."
>>> print(sc.punc_norm(x))           # 将中文标点符号转换成英文标点符号
测试,TestING:12345@#[]+=-()..

>>> # y = fillna(df) # 将pandas.DataFrame中的N/A单元格填充为长度为0的str

2. 基础NLP信息提取功能

该部分中,分词功能使用 jieba 实现,源码请参考:https://github.com/fxsjy/jieba

同/近义词查找功能复用了 synonyms 中的词向量数据文件,源码请参考:https://github.com/chatopera/Synonyms 但有所改动,改动如下

  1. 由于 pip 上传文件限制,synonyms 需要用户在完成 pip 安装后再下载词向量文件,国内下载需要设置镜像地址或使用特殊手段,有所不便。因此此处将词向量用 float16 表示,并使用 pca 降维至 64 维。总体效果差别不大,如果在意,请直接安装 synonyms 处理同/近义词查找任务。

  2. 原项目通过构建 KDTree 实现快速查找,但比较相似度是使用 cosine similarity,而 KDTree (sklearn) 本身不支持通过 cosine similarity 构建。因此原项目使用欧式距离构建树,导致输出结果有部分顺序混乱。为修复该问题,本项目将词向量归一化后再构建 KDTree,使得向量间的 cosine similarity 与欧式距离(即割线距离)正相关。具体推导可参考下文:https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-learn-kneighborsclassifier

  3. 原项目中未设置缓存上限,本项目中仅保留最近10000次查找记录。

x = "今天是我参加工作的第1天,我花了23.33元买了写零食犒劳一下自己。"
print(sc.extract_nums(x))              # 提取数字信息
[1.0, 23.33]

# mode: 0: No single character words. The words may be overlapped.
#       1: Have single character words. The words may be overlapped.
#       2: No single character words. The words are not overlapped.
#       3: Have single character words. The words are not overlapped.
#       4: Only single characters.
print(sc.extract_words(x, mode=0))      # 分词
['今天', '参加', '工作', '我花', '23.33', '零食', '犒劳', '一下', '自己']

a = "做人真的好难"
b = "做人实在太难了"
print(sc.string_distance(a,b))  # 编辑距离
0.46153846153846156

x = "种族歧视"
print(sc.find_synonyms(x, n=3))  # 同/近义词
[('种族歧视', 1.0), ('种族主义', 0.84619140625), ('歧视', 0.76416015625)]

3. 繁体简体转换

该部分使用 chinese_converter 实现,源码请参考:https://github.com/zachary822/chinese-converter

>> print(sc.to_traditional(x)) # 转换为繁体 烏龜測試123 >>> x = "烏龜測試123" >>> print(sc.to_simplified(x)) # 转换为简体 乌龟测试123 ">
>>> x = "乌龟测试123"
>>> print(sc.to_traditional(x))  # 转换为繁体
烏龜測試123

>>> x = "烏龜測試123"
>>> print(sc.to_simplified(x))   # 转换为简体
乌龟测试123

4. 特征提取和向量化

5. 词云和可视化

TODO:

  1. 句子向量化及句子相似度
  2. 其他特征提取相关工具
Owner
Ming
惊了
Ming
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Prithivida 690 Jan 04, 2023
This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

nutte-language This is the Alpha of Nutte language, it is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda My language was

catdochrome 2 Dec 18, 2021
GPT-3: Language Models are Few-Shot Learners

GPT-3: Language Models are Few-Shot Learners arXiv link Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-trainin

OpenAI 12.5k Jan 05, 2023
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

AI Singapore | AI Makerspace 21 Dec 17, 2022
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 04, 2022
Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022
VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

Shawn 1 Jan 21, 2022
내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Pocket Galaxy 아주 간단한 개인용, 혹은 내부용 툴을 만들어야하는데 이왕이면 웹이 편하죠? 그럴때를 위해 만들어둔 django와 vue(vuetify)로 이뤄진 boilerplate 입니다. 각 폴더에 있는 설명서대로 실행을 시키면 일단 당장 뭔가가 돌아갑니

Jamie J. Seol 16 Dec 03, 2021
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 06, 2023
Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

Hyunwoong Ko 72 Dec 07, 2022
Yodatranslator is a simple translator English to Yoda-language

yodatranslator Overview yodatranslator is a simple translator English to Yoda-language. Project is created for educational purposes. It is intended to

1 Nov 11, 2021
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
A tool helps build a talk preview image by combining the given background image and talk event description

talk-preview-img-builder A tool helps build a talk preview image by combining the given background image and talk event description Installation and U

PyCon Taiwan 4 Aug 20, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022