A Japanese tokenizer based on recurrent neural networks

Overview


Codacy Badge Build Status Build status Coverage Status Documentation Status PyPI Downloads

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool.

This tool has the following features.

  • Based on recurrent neural networks.
  • The word segmentation model uses character- and word-level features [池田+].
  • The POS-tagging model uses tag dictionary information [Inoue+].

For more details refer to the following links.

  • The slides at PyCon JP 2019 is available here.
  • The article in Japanese is available here.
  • The documentation is available here.

Installation

Python 2.7.x or 3.5+ is required. This tool uses DyNet (the Dynamic Neural Network Toolkit) to calcucate neural networks. You can install nagisa by using the following command.

pip install nagisa

For Windows users, please run it with python 3.6 or 3.7 (64bit).

Basic usage

Sample of word segmentation and POS-tagging for Japanese.

import nagisa

text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞

# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

Post-processing functions

Filter and extarct words by the specific POS tags.

# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

# Extarct only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞

# This is a list of available POS-tags in nagisa.
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']

Add the user dictionary in easy way.

# default
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号

# If a word ("3月のライオン") is included in the single_word_list, it is recognized as a single word.
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号

Train a model

Nagisa (v0.2.0+) provides a simple train method for a joint word segmentation and sequence labeling (e.g, POS-tagging, NER) model.

The format of the train/dev/test files is tsv. Each line is word and tag and one line is represented by word \t(tab) tag. Note that you put EOS between sentences. Refer to sample datasets and tutorial (Train a model for Universal Dependencies).

$ cat sample.train
唯一	NOUN
の	ADP
趣味	NOU
は	ADP
料理	NOUN
EOS
とても	ADV
おいしかっ	ADJ
た	AUX
です	AUX
。	PUNCT
EOS
ドル	NOUN
は	ADP
主要	ADJ
通貨	NOUN
EOS
# After finish training, save the three model files (*.vocabs, *.params, *.hp).
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")

# Build the tagger by loading the trained model files.
sample_tagger = nagisa.Tagger(vocabs='sample.vocabs', params='sample.params', hp='sample.hp')

text = "福岡・博多の観光情報"
words = sample_tagger.tagging(text)
print(words)
#> 福岡/PROPN ・/SYM 博多/PROPN の/ADP 観光/NOUN 情報/NOUN
Comments
  • Heroku deployment of NLP model Nagisa Tokenizer showing error

    Heroku deployment of NLP model Nagisa Tokenizer showing error

    Hi, I deployed my Flask App ( NLP model ) on Heroku. I was basically a price prediction model where some columns were in Japanese where I applied NLP + Nagisa Library for tokenization and some columns were numerical data. I pickled vectorizers and the model and Finally added them to my Flask API. But after deployment when I added the values in the frontend and clicked on Predict button, the result is not getting displayed. This is the exact error I am facing. image The exact code of Tokenizer_jp is : def tokenize_jp(doc): doc = nagisa.tagging(doc) return doc.words

    I am not able to figure out how to fix this? does Nagisa work in Heroku deployment? PS: I am not really sure if the problem is with Heroku or Nagisa, please help me with this.

    opened by Pranjal-bisht 22
  • AttributeError: module 'utils' has no attribute 'OOV'

    AttributeError: module 'utils' has no attribute 'OOV'

    Hi, I got error in 'import nagisa' as below

    OOV = utils.OOV AttributeError: module 'utils' has no attribute 'OOV'

    I did 'pip install nagisa' on the conda envrionment python 3.7 and 3.6 I ran it on my Mac.

    opened by RonenHong 15
  • Pip/pip3 install nagisa Error

    Pip/pip3 install nagisa Error

    Hello @taishi-i when i am trying to pip install nagisa getting below error. I tried to install through conda.

    Windows7 C:\Users\SAIKIRAN>python --version Python 3.8.3

    Error: ERROR: Command errored out with exit status 1: 'c:\users\saikiran\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0 = '"'"'C:\Users\SAIKIRAN\AppData\Local\Temp\pip-install-a31d0hp1\DyNet\setup.py'"'"'; file='"'"'C:\Users\SAIKIRAN\AppData\Local\Temp\pip-install-a31d0 1\DyNet\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, ' "'exec'"'"'))' install --record 'C:\Users\SAIKIRAN\AppData\Local\Temp\pip-record-mg2btvbb\install-record.txt' --single-version-externally-managed --compile Check the lo for full command output.

    opened by ssaikiran123 14
  • Wheel request for Python 3.8

    Wheel request for Python 3.8

    Hello, thank you for maintaining the awesome toolkit!

    I think we cannot install nagisa by pip install nagisa on Python>=3.8. This is because:

    • (a) dynet uses the old URL for eigen (https://github.com/clab/dynet/issues/1616). Some commits (e.g. https://github.com/clab/dynet/commit/b800ed0f4c48f234bceaf9fa3d61974cef3e0029) were pushed for this problem but no-release including them is available.
    • (b) nagisa doesn't provide wheel for the latest versions of Python. If someone wants to install nagisa on Python<=3.7, it works well as wheels are uploaded to https://pypi.org/project/nagisa/#files. However, for Python>=3.8, pip will install nagisa from the source. This may not work well because of the problem (a).

    The full output of pip install nagisa on Python3.8: https://gist.github.com/himkt/1bc75b83f1735535c4df0b952f352bf6

    opened by himkt 10
  • Improving the handling of numerals of nagisa's word tokenizer

    Improving the handling of numerals of nagisa's word tokenizer

    I'm using nagisa v0.1.1. There's some problems about the tokenizer's handling of numerals, the numbers and decimals are split as single characters and tagged as "名詞" 357 -> 3_名詞 5_名詞 7_名詞 # Numbers 1.48 -> 1_名詞 ._名詞 4_名詞 8_名詞 # Decimals $5.5 -> $_補助記号 5_名詞 ._補助記号 5_名詞 # Numbers with currency symbols (and other symbols) 133-1111-2222 -> 1_名詞 3_名詞 3_名詞 -_補助記号 1_名詞 1_名詞 1_名詞 1_名詞 -_補助記号 2_名詞 2_名詞 2_名詞 2_名詞 # Phone numbers

    and etc... Is it possible to improve this?

    opened by BLKSerene 4
  • request: comparison to other tokenizers/PoS taggers

    request: comparison to other tokenizers/PoS taggers

    Could you include some notes briefly comparing this to other parses like Mecab? Mecab includes a comparison to other tokenizers/parsers. I think users would greatly benefit from knowing things like parsing speed comparisons, accuracy, and other slight differences/nuances/use cases.

    opened by SpongebobSquamirez 4
  • error: command 'cl.exe' failed: No such file or directory

    error: command 'cl.exe' failed: No such file or directory

    When I use pip install nagisa to install,the error message is:

    Collecting nagisa Using cached https://files.pythonhosted.org/packages/a1/40/a94f7944ee5d6a4d44eadcc966fe0d46b5155fb139d7b4d708e439617df1/nagisa-0.1.1.tar.gz Requirement already satisfied: six in e:\anaconda3\lib\site-packages (from nagisa) (1.11.0) Requirement already satisfied: numpy in e:\anaconda3\lib\site-packages (from nagisa) (1.14.0) Requirement already satisfied: DyNet in e:\anaconda3\lib\site-packages (from nagisa) (2.1) Requirement already satisfied: cython in e:\anaconda3\lib\site-packages (from DyNet->nagisa) (0.27.3) Building wheels for collected packages: nagisa Running setup.py bdist_wheel for nagisa ... error Complete output from command e:\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\test\AppData\Local\Temp\pip-install-t_9_vdzk\nagisa\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\test\AppData\Local\Temp\pip-wheel-dmgx_3eh --python-tag cp36: running bdist_wheel Warning: Extension name 'utils' does not match fully qualified name 'nagisa.utils' of 'nagisa/utils.pyx' running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\nagisa copying nagisa\mecab_system_eval.py -> build\lib.win-amd64-3.6\nagisa copying nagisa\model.py -> build\lib.win-amd64-3.6\nagisa copying nagisa\prepro.py -> build\lib.win-amd64-3.6\nagisa copying nagisa\tagger.py -> build\lib.win-amd64-3.6\nagisa copying nagisa\train.py -> build\lib.win-amd64-3.6\nagisa copying nagisa_init_.py -> build\lib.win-amd64-3.6\nagisa running egg_info writing nagisa.egg-info\PKG-INFO writing dependency_links to nagisa.egg-info\dependency_links.txt writing requirements to nagisa.egg-info\requires.txt writing top-level names to nagisa.egg-info\top_level.txt reading manifest file 'nagisa.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'nagisa.egg-info\SOURCES.txt' copying nagisa\utils.c -> build\lib.win-amd64-3.6\nagisa copying nagisa\utils.pyx -> build\lib.win-amd64-3.6\nagisa creating build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\models.jpg -> build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\nagisa_image.jpg -> build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\nagisa_v001.dict -> build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\nagisa_v001.hp -> build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\nagisa_v001.model -> build\lib.win-amd64-3.6\nagisa\data running build_ext building 'utils' extension creating build\temp.win-amd64-3.6 creating build\temp.win-amd64-3.6\Release creating build\temp.win-amd64-3.6\Release\nagisa cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ie:\anaconda3\lib\site-packages\numpy\core\include -Ie:\anaconda3\include -Ie:\anaconda3\include /Tcnagisa/utils.c /Fobuild\temp.win-amd64-3.6\Release\nagisa/utils.obj error: command 'cl.exe' failed: No such file or directory


    Failed building wheel for nagisa Running setup.py clean for nagisa Failed to build nagisa Installing collected packages: nagisa Running setup.py install for nagisa ... error Complete output from command e:\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\test\AppData\Local\Temp\pip-install-t_9_vdzk\nagisa\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\test\AppData\Local\Temp\pip-record-p2d6rr5x\install-record.txt --single-version-externally-managed --compile: running install Warning: Extension name 'utils' does not match fully qualified name 'nagisa.utils' of 'nagisa/utils.pyx' running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\nagisa copying nagisa\mecab_system_eval.py -> build\lib.win-amd64-3.6\nagisa copying nagisa\model.py -> build\lib.win-amd64-3.6\nagisa copying nagisa\prepro.py -> build\lib.win-amd64-3.6\nagisa copying nagisa\tagger.py -> build\lib.win-amd64-3.6\nagisa copying nagisa\train.py -> build\lib.win-amd64-3.6\nagisa copying nagisa_init_.py -> build\lib.win-amd64-3.6\nagisa running egg_info writing nagisa.egg-info\PKG-INFO writing dependency_links to nagisa.egg-info\dependency_links.txt writing requirements to nagisa.egg-info\requires.txt writing top-level names to nagisa.egg-info\top_level.txt reading manifest file 'nagisa.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'nagisa.egg-info\SOURCES.txt' copying nagisa\utils.c -> build\lib.win-amd64-3.6\nagisa copying nagisa\utils.pyx -> build\lib.win-amd64-3.6\nagisa creating build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\models.jpg -> build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\nagisa_image.jpg -> build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\nagisa_v001.dict -> build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\nagisa_v001.hp -> build\lib.win-amd64-3.6\nagisa\data copying nagisa\data\nagisa_v001.model -> build\lib.win-amd64-3.6\nagisa\data running build_ext building 'utils' extension creating build\temp.win-amd64-3.6 creating build\temp.win-amd64-3.6\Release creating build\temp.win-amd64-3.6\Release\nagisa cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ie:\anaconda3\lib\site-packages\numpy\core\include -Ie:\anaconda3\include -Ie:\anaconda3\include /Tcnagisa/utils.c /Fobuild\temp.win-amd64-3.6\Release\nagisa/utils.obj error: command 'cl.exe' failed: No such file or directory

    ----------------------------------------
    

    Command "e:\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\test\AppData\Local\Temp\pip-install-t_9_vdzk\nagisa\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\test\AppData\Local\Temp\pip-record-p2d6rr5x\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\test\AppData\Local\Temp\pip-install-t_9_vdzk\nagisa\

    How to fix it?

    opened by dapsjj 4
  • Drop support for Python2.7?

    Drop support for Python2.7?

    The EOL of Python2.7 is January 1, 2020. As many other major open-source project, is there any plan for a new version of nagisa that will drop support for Python2.7 and support only Python3?

    The Python3-only version could remove the dependency of six and lighten the burden of maintenance work in the future.

    opened by BLKSerene 3
  • Returning a generator instead of a list in nagisa.postagging

    Returning a generator instead of a list in nagisa.postagging

    Hi, I'm trying to figure out how to POS-tag a list of tokens that have already been tokenized and I found #8 , which works fine.

    And I think that returning a generator instead of a list would be better for users, since it will create a long list of POS tags in-memory for a large input text. And in most cases, the returned POS-tags are to be iterated over (usually only once) to be zipped with the tokens.

    Or, you could provide two functions, like postagging and lpostagging, the former one returning a generator and the latter one returning a common list.

    opened by BLKSerene 3
  • Illegal instruction (core dumped)

    Illegal instruction (core dumped)

    Thanks for building this. I've been trying mecab and not been getting the exact results that I need and thought I'd give this a try.

    For now, I have this working on a centos box, but I'm wanting to get this working on ubuntu as it's my main dev machine.

    I keep getting:

    [dynet] random seed: 1234
    [dynet] allocating memory: 32MB
    Illegal instruction (core dumped)
    

    Distributor ID: Ubuntu Description: Ubuntu 20.04 LTS Release: 20.04 Codename: focal

    • Python 3.8.5
    • 8GB laptop.

    Is there any more information you need? Thanks

    opened by paulm17 2
  • Why do you have 6 dim outputs for word segmentation?

    Why do you have 6 dim outputs for word segmentation?

    from https://github.com/taishi-i/nagisa/blob/master/nagisa/model.py#L59, Why do you have 6 DIM outputs for word segmentation? encode_ws has 6 DIM outputs. I understand you using BMES (4 dim first). What are the last two boxes used for? Could you explain that, please?

    Thank you.

    opened by wannaphong 2
  • building nagisa on m1

    building nagisa on m1

    I am facing this issue:

    [notice] To update, run: pip install --upgrade pip
    (venv) [email protected] vocab % pip install nagisa
    Collecting nagisa
      Using cached nagisa-0.2.8.tar.gz (20.9 MB)
      Preparing metadata (setup.py) ... done
    Collecting six
      Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
    Collecting numpy
      Using cached numpy-1.23.4-cp310-cp310-macosx_11_0_arm64.whl (13.3 MB)
    Collecting nagisa
      Using cached nagisa-0.2.7.tar.gz (20.9 MB)
      Preparing metadata (setup.py) ... done
    Collecting DyNet
      Using cached dyNET-2.1.2.tar.gz (509 kB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... done
      Preparing metadata (pyproject.toml) ... done
    Collecting cython
      Using cached Cython-0.29.32-py2.py3-none-any.whl (986 kB)
    Building wheels for collected packages: nagisa, DyNet
      Building wheel for nagisa (setup.py) ... done
      Created wheel for nagisa: filename=nagisa-0.2.7-cp310-cp310-macosx_11_0_arm64.whl size=21306402 sha256=c559ab30293dffc0d1ae36d215725dec08da0910ed1c3331728c398397258d2f
      Stored in directory: /Users/b/Library/Caches/pip/wheels/cf/38/0b/463d99fdf6d3c736cfcb4124124496513831eeefdc7f896391
      Building wheel for DyNet (pyproject.toml) ... error
      error: subprocess-exited-with-error
    
      × Building wheel for DyNet (pyproject.toml) did not run successfully.
      │ exit code: 1
      ╰─> [101 lines of output]
          /private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-build-env-rvxcggqa/overlay/lib/python3.10/site-packages/setuptools/dist.py:530: UserWarning: Normalizing 'v2.1.2' to '2.1.2'
            warnings.warn(tmpl.format(**locals()))
          /private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-build-env-rvxcggqa/overlay/lib/python3.10/site-packages/setuptools/dist.py:771: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
            warnings.warn(
          running bdist_wheel
          running build
          INFO:root:CMAKE_PATH='/opt/homebrew/bin/cmake'
          INFO:root:MAKE_PATH='/usr/bin/make'
          INFO:root:MAKE_FLAGS='-j 8'
          INFO:root:EIGEN3_INCLUDE_DIR='/private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a/build/py3.10-64bit/eigen'
          INFO:root:EIGEN3_DOWNLOAD_URL='https://github.com/clab/dynet/releases/download/2.1/eigen-b2e267dc99d4.zip'
          INFO:root:CC_PATH='/usr/bin/gcc'
          INFO:root:CXX_PATH='/usr/bin/g++'
          INFO:root:SCRIPT_DIR='/private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a'
          INFO:root:BUILD_DIR='/private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a/build/py3.10-64bit'
          INFO:root:INSTALL_PREFIX='/Users/b/study/jap/vocab/venv/lib/python3.10/site-packages/../../..'
          INFO:root:PYTHON='/Users/b/study/jap/vocab/venv/bin/python3.10'
          cmake version 3.24.1
    
          CMake suite maintained and supported by Kitware (kitware.com/cmake).
          Apple clang version 13.1.6 (clang-1316.0.21.2.5)
          Target: arm64-apple-darwin21.6.0
          Thread model: posix
          InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
          INFO:root:Creating build directory /private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a/build/py3.10-64bit
          INFO:root:Fetching Eigen...
          INFO:root:Unpacking Eigen...
          INFO:root:Configuring...
          -- The C compiler identification is AppleClang 13.1.6.13160021
          -- The CXX compiler identification is AppleClang 13.1.6.13160021
          -- Detecting C compiler ABI info
          -- Detecting C compiler ABI info - done
          -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/usr/bin/gcc - skipped
          -- Detecting C compile features
          -- Detecting C compile features - done
          -- Detecting CXX compiler ABI info
          -- Detecting CXX compiler ABI info - done
          -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/usr/bin/g++ - skipped
          -- Detecting CXX compile features
          -- Detecting CXX compile features - done
          CMake Deprecation Warning at CMakeLists.txt:2 (cmake_minimum_required):
            Compatibility with CMake < 2.8.12 will be removed from a future version of
            CMake.
    
            Update the VERSION argument <min> value or use a ...<max> suffix to tell
            CMake that the project does not need compatibility with older versions.
    
    
          -- Optimization level: fast
          -- BACKEND not specified, defaulting to eigen.
          -- Eigen dir is /private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a/build/py3.10-64bit/eigen
          -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
          -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
          -- Found Threads: TRUE
          -- Found Cython version 0.29.32
    
          CMAKE_INSTALL_PREFIX="/Users/b/study/jap/vocab/venv"
          PROJECT_SOURCE_DIR="/private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a"
          PROJECT_BINARY_DIR="/private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a/build/py3.10-64bit"
          LIBS=""
          EIGEN3_INCLUDE_DIR="/private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a/build/py3.10-64bit/eigen"
          MKL_LINK_DIRS=""
          WITH_CUDA_BACKEND=""
          CUDA_RT_FILES=""
          CUDA_RT_DIRS=""
          CUDA_CUBLAS_FILES=""
          CUDA_CUBLAS_DIRS=""
          MSVC=""
          fatal: not a git repository (or any of the parent directories): .git
          -- Configuring done
          -- Generating done
          -- Build files have been written to: /private/var/folders/yv/lystpk8n2015cf8vmqd2yj_c0000gp/T/pip-install-v2h7cwoe/dynet_f6727a54d6ce4c5d83d9578e2d0a272a/build/py3.10-64bit
          INFO:root:Compiling...
          [  4%] Building CXX object dynet/CMakeFiles/dynet.dir/deep-lstm.cc.o
          [  4%] Building CXX object dynet/CMakeFiles/dynet.dir/exec.cc.o
          [  4%] Building CXX object dynet/CMakeFiles/dynet.dir/aligned-mem-pool.cc.o
          [  5%] Building CXX object dynet/CMakeFiles/dynet.dir/cfsm-builder.cc.o
          [  8%] Building CXX object dynet/CMakeFiles/dynet.dir/dynet.cc.o
          [  8%] Building CXX object dynet/CMakeFiles/dynet.dir/dict.cc.o
          [ 10%] Building CXX object dynet/CMakeFiles/dynet.dir/devices.cc.o
          [ 11%] Building CXX object dynet/CMakeFiles/dynet.dir/dim.cc.o
          clang: error: the clang compiler does not support '-march=native'
          clang: error: the clang compiler does not support '-march=native'
          clang: error: the clang compiler does not support '-march=native'
          clang: error: the clang compiler does not support '-march=native'
          clang: error: the clang compiler does not support '-march=native'
          clang: error: the clang compiler does not support '-march=native'
          make[2]: *** [dynet/CMakeFiles/dynet.dir/devices.cc.o] Error 1
          make[2]: *** Waiting for unfinished jobs....
          make[2]: *** [dynet/CMakeFiles/dynet.dir/aligned-mem-pool.cc.o] Error 1
          make[2]: *** [dynet/CMakeFiles/dynet.dir/dynet.cc.o] Error 1
          make[2]: *** [dynet/CMakeFiles/dynet.dir/cfsm-builder.cc.o] Error 1
          clang: error: the clang compiler does not support '-march=native'
          clang: error: the clang compiler does not support '-march=native'
          make[2]: *** [dynet/CMakeFiles/dynet.dir/dim.cc.o] Error 1
          make[2]: *** [dynet/CMakeFiles/dynet.dir/deep-lstm.cc.o] Error 1
          make[2]: *** [dynet/CMakeFiles/dynet.dir/dict.cc.o] Error 1
          make[2]: *** [dynet/CMakeFiles/dynet.dir/exec.cc.o] Error 1
          make[1]: *** [dynet/CMakeFiles/dynet.dir/all] Error 2
          make: *** [all] Error 2
          error: /usr/bin/make -j 8
          [end of output]
    
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for DyNet
    Successfully built nagisa
    

    any ideas?

    opened by dataf3l 1
  • core dumped

    core dumped

    I am running manjaro linux on a thinkpad x230, using python 3.9.7 and the version of nagisa from pip. When i run import nagisa i get Illegal instruction (core dumped)

    opened by ryanswilson59 4
  • add cache layer to Tagger

    add cache layer to Tagger

    if instantiating Tagger at function level it will load dictionary every time, if instantiating Tagger at module level it will load dictionary therefore may not actually use refer to https://github.com/fxsjy/jieba/blob/master/jieba/init.py

    opened by bung87 4
Releases(0.2.8)
  • 0.2.8(Sep 9, 2022)

    nagisa 0.2.8 incorporates the following changes:

    When tokenizing a text containing 'İ', an AttributeError has occurred. This is because, as the following example shows, lowering 'İ' would have changed to the length of 2, and would not have been extracting features correctly.

    >>> text = "İ" # [U+0130]
    >>> print(len(text))
    1
    >>> text = text.lower() # [U+0069] [U+0307]
    >>> print(text)
    'i̇'
    >>> print(len(text))
    2
    

    To avoid this error, the following preprocess was added to the source code modification 1, modification 2.

    text = text.replace('İ', 'I')
    
    • Add Python wheels (3.6, 3.7, 3.8, 3.9, 3.10, 3.11) to PyPI for Linux
    • Add Python wheels (3.6, 3.7, 3.8, 3.9, 3.10) to PyPI for macOS
    • Add Python wheels (3.6, 3.7, 3.8) to PyPI for Windows
    Source code(tar.gz)
    Source code(zip)
    nagisa-0.2.8.tar.gz(19.93 MB)
  • 0.2.7(Jul 6, 2020)

    nagisa 0.2.7 incorporates the following changes:

    • Fix AttributeError: module 'utils' to rename utils.pyx into nagisa_utils.pyx #14
    • Add wheels to PyPI for Linux and Windows users
    • Increase test coverage from 92% to 96%
    • Fix the problem where min_count (threshold=hp['THRESHOLD']) parameter was not used in train.py
    Source code(tar.gz)
    Source code(zip)
  • 0.2.6(Jun 11, 2020)

    nagisa 0.2.6 incorporates the following changes:

    • Increase test coverage from 88% to 92%
    • Fix readFile(filename) in mecab_system_eval.py for windows users
    • Add python3.7 to .travis.yml
    • Add a DOI with the data archiving tool Zenodo to README.md
    • Add nagisa-0.2.6-cp36-cp36m-win_amd64.whl and nagisa-0.2.6-cp37-cp37m-win_amd64.whl to PyPI to install nagisa without Build Tools for Windows users #23
    • Add nagisa-0.2.6-*-manylinux1_i686.whl and nagisa-0.2.6-*-manylinux1_x86_64.whl to PyPI to install nagisa for Linux users
    Source code(tar.gz)
    Source code(zip)
  • 0.2.5(Dec 31, 2019)

    nagisa 0.2.5 incorporates the following changes:

    • Fix a white space bug in nagisa.decode. This fix resolves an error that occurs when decoding(nagisa.decode) words - contain whitespace.
    • Add __version__ to __init__.py
    • Add slides link at PyCon JP 2019 to README.md
    Source code(tar.gz)
    Source code(zip)
  • 0.2.4(Aug 5, 2019)

    nagisa 0.2.4 incorporates the following changes:

    • Add the new tutorial to the document (train a model for Japanese NER).
    • Add load_file function to nagisa.utils.
    • Fix 'single_word_list' compiler in nagisa.Tagger and support word segmentation using a regular expression.
    Source code(tar.gz)
    Source code(zip)
  • 0.2.3(May 19, 2019)

    nagisa 0.2.3 incorporates the following changes:

    • FIx #11 . By separating tagging into word segmentation and POS tagging in tagger.py, nagisa.tagging reduces wasteful memory and improves the speed in word segmentation.
    • Fix typo in README.md
    Source code(tar.gz)
    Source code(zip)
  • 0.2.2(May 3, 2019)

    nagisa 0.2.2 incorporates the following changes:

    • Update the document (e.g, add train a model for Japanese Universal Dependencies).
    • Fix log output of nagisa.fit function.
    • Fix issues from Codacy (e.g, delete unused codes in train.py).
    • Add appveyor.yml for Windows users.
    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Jan 15, 2019)

    nagisa 0.2.0 incorporates the following changes:

    • Provide a simple train method for a joint word segmentation and sequence labeling (e.g, POS-tagging, NER) model.
    • Fix ZeroDivisionError in mecab_system_eval.py.
    Source code(tar.gz)
    Source code(zip)
  • 0.1.2(Dec 25, 2018)

    nagisa 0.1.2 incorporates the following changes:

    • Provide the postagging method #8
    • Adopt the longest match to extract a word in nagisa.Tagger(single_word_list)
    Source code(tar.gz)
    Source code(zip)
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP)

Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (

jawahar 20 Apr 30, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 02, 2023
Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

BADER ALABDAN 2 Oct 22, 2022
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
ChatBotProyect - This is an unfinished project about a simple chatbot.

chatBotProyect This is an unfinished project about a simple chatbot. (union_todo.ipynb) Reminders for the project: Find why one of the vectorizers fai

Tomás 0 Jul 24, 2022
MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

willow 6 Jun 27, 2022
CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

CJK computer science terms comparison This repository contains the source code of the website. You can see the website from the following link: Englis

Hong Minhee (洪 民憙) 88 Dec 23, 2022
Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models 🌲🌱

Recognai 65 Sep 13, 2022
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

AI Singapore | AI Makerspace 21 Dec 17, 2022
Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

kangsukmin 2 Oct 20, 2021
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Prefere

SafeGraph 251 Jan 01, 2023