100+ Chinese Word Vectors 上百种预训练中文词向量

Overview

Chinese Word Vectors 中文词向量

中文

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

Reference

Please cite the paper, if using these embeddings and CA8 dataset.

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

@InProceedings{P18-2023,
  author =  "Li, Shen
    and Zhao, Zhe
    and Hu, Renfen
    and Li, Wensi
    and Liu, Tao
    and Du, Xiaoyong",
  title =   "Analogical Reasoning on Chinese Morphological and Semantic Relations",
  booktitle =   "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
  year =  "2018",
  publisher =   "Association for Computational Linguistics",
  pages =   "138--143",
  location =  "Melbourne, Australia",
  url =   "http://aclweb.org/anthology/P18-2023"
}

 

A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper:

Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper)

@incollection{qiu2018revisiting,
  title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings},
  author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao},
  booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data},
  pages={209--221},
  year={2018},
  publisher={Springer}
}

Format

The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.

Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.

Pre-trained Chinese Word Vectors

Basic Settings

                                       
Window Size Dynamic Window Sub-sampling Low-Frequency Word Iteration Negative Sampling*
5 Yes 1e-5 10 5 5

*Only for SGNS.

Various Domains

Chinese Word Vectors trained with different representations, context features, and corpora.

Word2vec / Skip-Gram with Negative Sampling (SGNS)
Corpus Context Features
Word Word + Ngram Word + Character Word + Character + Ngram
Baidu Encyclopedia 百度百科 300d 300d 300d 300d / PWD: 5555
Wikipedia_zh 中文维基百科 300d 300d 300d 300d
People's Daily News 人民日报 300d 300d 300d 300d
Sogou News 搜狗新闻 300d 300d 300d 300d
Financial News 金融新闻 300d 300d 300d 300d
Zhihu_QA 知乎问答 300d 300d 300d 300d
Weibo 微博 300d 300d 300d 300d
Literature 文学作品 300d 300d / PWD: z5b4 300d 300d / PWD: yenb
Complete Library in Four Sections
四库全书*
300d 300d NAN NAN
Mixed-large 综合
Baidu Netdisk / Google Drive
300d
300d
300d
300d
300d
300d
300d
300d
Positive Pointwise Mutual Information (PPMI)
Corpus Context Features
Word Word + Ngram Word + Character Word + Character + Ngram
Baidu Encyclopedia 百度百科 Sparse Sparse Sparse Sparse
Wikipedia_zh 中文维基百科 Sparse Sparse Sparse Sparse
People's Daily News 人民日报 Sparse Sparse Sparse Sparse
Sogou News 搜狗新闻 Sparse Sparse Sparse Sparse
Financial News 金融新闻 Sparse Sparse Sparse Sparse
Zhihu_QA 知乎问答 Sparse Sparse Sparse Sparse
Weibo 微博 Sparse Sparse Sparse Sparse
Literature 文学作品 Sparse Sparse Sparse Sparse
Complete Library in Four Sections
四库全书*
Sparse Sparse NAN NAN
Mixed-large 综合 Sparse Sparse Sparse Sparse

*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.

Various Co-occurrence Information

We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.

In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.

All vectors are trained by SGNS on Baidu Encyclopedia.

                                                       
Feature Co-occurrence Type Target Word Vectors Context Word Vectors
Word Word → Word 300d 300d
Ngram Word → Ngram (1-2) 300d 300d
Word → Ngram (1-3) 300d 300d
Ngram (1-2) → Ngram (1-2) 300d 300d
Character Word → Character (1) 300d 300d
Word → Character (1-2) 300d 300d
Word → Character (1-4) 300d 300d
Radical Radical 300d 300d
Position Word → Word (left/right) 300d 300d
Word → Word (distance) 300d 300d
Global Word → Text 300d 300d
Syntactic Feature Word → POS 300d 300d
Word → Dependency 300d 300d

Representations

Existing word representation methods fall into one of the two classes, dense and sparse represnetations. SGNS model (a model in word2vec toolkit) and PPMI model are respectively typical methods of these two classes. SGNS model trains low-dimensional real (dense) vectors through a shallow neural network. It is also called neural embedding method. PPMI model is a sparse bag-of-feature representation weighted by positive-pointwise-mutual-information (PPMI) weighting scheme.

Context Features

Three context features: word, ngram, and character are commonly used in the word embedding literature. Most word representation methods essentially exploit word-word co-occurrence statistics, namely using word as context feature (word feature). Inspired by language modeling problem, we introduce ngram feature into the context. Both word-word and word-ngram co-occurrence statistics are used for training (ngram feature). For Chinese, characters (Hanzi) often convey strong semantics. To this end, we consider using word-word and word-character co-occurrence statistics for learning word vectors. The length of character-level ngrams ranges from 1 to 4 (character feature).

Besides word, ngram, and character, there are other features which have substantial influence on properties of word vectors. For example, using entire text as context feature could introduce more topic information into word vectors; using dependency parse as context feature could add syntactic constraint to word vectors. 17 co-occurrence types are considered in this project.

Corpus

We made great efforts to collect corpus across various domains. All text data are preprocessed by removing html and xml tags. Only the plain text are kept and HanLP(v_1.5.3) is used for word segmentation. In addition, traditional Chinese characters are converted into simplified characters with Open Chinese Convert (OpenCC). The detailed corpora information is listed as follows:

Corpus Size Tokens Vocabulary Size Description
Baidu Encyclopedia
百度百科
4.1G 745M 5422K Chinese Encyclopedia data from
https://baike.baidu.com/
Wikipedia_zh
中文维基百科
1.3G 223M 2129K Chinese Wikipedia data from
https://dumps.wikimedia.org/
People's Daily News
人民日报
3.9G 668M 1664K News data from People's Daily(1946-2017)
http://data.people.com.cn/
Sogou News
搜狗新闻
3.7G 649M 1226K News data provided by Sogou labs
http://www.sogou.com/labs/
Financial News
金融新闻
6.2G 1055M 2785K Financial news collected from multiple news websites
Zhihu_QA
知乎问答
2.1G 384M 1117K Chinese QA data from
https://www.zhihu.com/
Weibo
微博
0.73G 136M 850K Chinese microblog data provided by NLPIR Lab
http://www.nlpir.org/wordpress/download/weibo.7z
Literature
文学作品
0.93G 177M 702K 8599 modern Chinese literature works
Mixed-large
综合
22.6G 4037M 10653K We build the large corpus by merging the above corpora.
Complete Library in Four Sections
四库全书
1.5G 714M 21.8K The largest collection of texts in pre-modern China.

All words are concerned, including low frequency words.

Toolkits

All word vectors are trained by ngram2vec toolkit. Ngram2vec toolkit is a superset of word2vec and fasttext toolkit, where arbitrary context features and models are supported.

Chinese Word Analogy Benchmarks

The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated, where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations. The CA-translated, CA8, and their detailed descriptions are provided in testsets folder.

Evaluation Toolkit

We present an evaluation toolkit in evaluation folder.

Run the following codes to evaluate dense vectors.

$ python ana_eval_dense.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_dense.py -v <vector.txt> -a CA8/semantic.txt

Run the following codes to evaluate sparse vectors.

$ python ana_eval_sparse.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_sparse.py -v <vector.txt> -a CA8/semantic.txt
Owner
embedding
embedding
一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Takker - 一个普通的QQ机器人 此项目为基于 Nonebot2 和 go-cqhttp 开发,以 Sqlite 作为数据库的QQ群娱乐机器人 关于 纯兴趣开发,部分功能借鉴了大佬们的代码,作为Q群的娱乐+功能性Bot 声明 此项目仅用于学习交流,请勿用于非法用途 这是开发者的第一个Pytho

风屿 79 Dec 29, 2022
keras implement of transformers for humans

keras implement of transformers for humans

苏剑林(Jianlin Su) 4.8k Jan 03, 2023
Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

Yiming Cui 463 Dec 30, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022
🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

Mohammad Javad Hossieni 4 Aug 10, 2022
customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. 🛠 Features Basic E-c

Dishant Gandhi 23 Oct 27, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022
MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

Microsoft 228 Nov 21, 2022
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Libo Qin 132 Nov 25, 2022
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022
this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 02, 2021
Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms 613 Jan 07, 2023
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022