Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Last update: Jul 23, 2022

Related tags

Overview

japanese-ebook-analysis

This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technical users. You can analyse an ebook, and see the following information:

The length of the book in words
The length of the book in characters
The number of unique words used in the book
The number of unique words that are only used once in the book
The percentage of unique words that are only used once
The number of unique characters used
The number of unique characters that are only used once
The percentage of unique characters that are only used once
A list of all the words used in the book as well as how often they are used
A list of all the characters used in the book as well as how often they are used

For text processing, we use MeCab

Usage

Currently, the project is not deployed anywhere, so to use the service, you will need to follow the steps below in the development section to get the server running.

Upload a .epub file containing japanese text to the server
The server will redirect you to a page showing you information about the ebook. You can then also click the 'See more details' button to see all the generated data, including a list of all the words used together with how many occurences there are for each word, and the same for the characters as well.

Development

Clone repository: git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
Make sure you have mecab set up on your system. See http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/
(Only required if you will actually upload ebooks or run the analyse_epub.py script), which you will not need to do to contribute to other parts of the app. for a good guide on how to set it up.
Install python dependencies: pip install -r requirements.txt
Install other dependencies (these all need to be in your system path):
- pandoc
Run ./app.py to start the flask dev server

Contributing

I'm very happy for any happy contributions! Before contributing, please have a look at CONTRIBUTING.md.

To see what needs work on, have a look at the repo's Issues and its Pull requests.

Feel free to submit your own issue or pull request about a new feature or anything else. When submitting a pull request, don't be afraid to modify any of the files; I'm not very attached to the coding style used in the repo.

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Related tags

Overview

japanese-ebook-analysis

Usage

Development

Contributing

Owner

Christoffer Aakre

Code voor mijn Master project omtrent VideoBERT

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Watson Natural Language Understanding and Knowledge Studio

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

Natural Language Processing Specialization

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

【原神】自动演奏风物之诗琴的程序

CMeEE 数据集医学实体抽取

A complete NLP guideline for enthusiasts

[ICLR'19] Trellis Networks for Sequence Modeling

This is a NLP based project to extract effective date of the contract from their text files.

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Nested Named Entity Recognition