Get list of common stop words in various languages in Python

Last update: Dec 21, 2022

Overview

Python Stop Words

Table of contents

Overview
Available languages
Installation
Basic usage
Python compatibility

Overview

Get list of common stop words in various languages in Python.

Available languages

Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian

Installation

stop-words is available on PyPI

http://pypi.python.org/pypi/stop-words

So easily install it by pip

$ pip install stop-words

Another way is by cloning stop-words's git repo

$ git clone --recursive git://github.com/Alir3z4/python-stop-words.git

Then install it by running:

$ python setup.py install

Basic usage

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')

Python compatibility

Python Stop Words is compatibe with:

Python 2.7
Python 3.4
Python 3.5
Python 3.6
Python 3.7

Comments

Enforces packaging of eggs into folders.

We had an error in our CI pipeline where a package build would fail since the .egg of stop-words is downloaded as a zip.

This leads to the following error where the initializer tries to open a directory when it is actually a zip archive.

Not a directory: '/opt/project/.eggs/stop_words-2015.2.23.1-py3.6.egg/stop_words/stop-words/languages.json'

opened by hfjn 10
add indonesian stop word list

Add stop word list for indonesian language, added mapping to JSON file. Source: https://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf

opened by frankdevans 4
can you handle a text？

hello, no description about how to use. Now I have a text: The University of Waterloo Stratford Campus is located in Stratford Ontario Canada. It is one of the three satellite campuses of the University of Waterloo a member of the U15 Group of Canadian Research Universities.Established in June 2009 the University of Waterloo Stratford Campus is part of the Faculty of Arts at the University of Waterloo. how to use python-stop-words to filter the stop-words to get a text without stop-words?

thank you very much!!
question

opened by PapaMadeleine2022 2
Python 3 support
List of improvements:

Tests

Python 3 support

Dev installation via zc.buildout

Continuous integration via Travis

Can you make a new release once the branch merged ?

Regards
enhancement
opened by Fantomas42 2
languages.json is missing, if you don't git clone with `--recursive`

languages.json is still missing, if you don't clone with --recursive

$ git clone git://github.com/Alir3z4/python-stop-words.git $ cd python-stop-words $ python3 setup.py install Traceback (most recent call last): File "setup.py", line 5, in version=import("stop_words").get_version(), File "./stop_words/init.py", line 9, in with open(os.path.join(STOP_WORDS_DIR, 'languages.json'), 'rb') as map_file: FileNotFoundError: [Errno 2] No such file or directory: './stop_words/stop-words/languages.json'

opened by marcindulak 1
Update submodule to the latest

Include the stops for newly added languages

https://github.com/Alir3z4/stop-words/pull/4 https://github.com/Alir3z4/stop-words/pull/5 https://github.com/Alir3z4/stop-words/pull/6 https://github.com/Alir3z4/stop-words/pull/7
enhancement

opened by norkans7 1
Decode error AND Add catalan language to LANGUAGE_MAPPING
1. Add catalan language to LANGUAGE_MAPPING. I previously I added the file with stop words in project "stop-words"

2. Decode error

stop_words = [line.strip().decode('utf-8') for line in language_file.readlines()]

Strip() return a copy of the string with leading and trailing whitespace characters removed. But if the string contains non-ascii characters, Strip() causes a UnicodeDecodeError error (eg UnicodeDecodeError: 'utf8' codec can not decode byte 0xc3 in position 34: unexpected end of data).

The workaround is to reorder the call:

stop_words = [line.decode('utf-8').strip() for line in language_file.readlines()]
opened by dmiro 1
Defining custom stop words in NLTK

Hi, I want to know what is the method for defining our own custom stop word? I'm currently developing a sentiment analysis in my local language in which i'm using Naive Bayes classifier to classify the text. I'm quite new to this type of NLP project so sorry if there's a method that I miss.

Hope you can help me thanks.

opened by AllikDaniel 0

Example not work on python 3.7.0

It return empty []

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')
print(stop_words)

opened by nadavvin 2

Releases(2018.7.23)

2018.7.23(Jul 23, 2018)
2018.7.23

Fixed #14: languages.json is missing, if you don't git clone with --recursive.

Feature: Support latest version of Python (3.7+).

Feature #22: Enforces packaging of eggs into folders.

Update the stop-words repository to get the latest languages.

Fixed Travis failing and tests due to bootstrap.

PyPI: https://pypi.org/project/stop-words/2018.7.23/

To install:

$ pip install stop-words==2018.7.23
Source code(tar.gz)
Source code(zip)
2015.2.23.1(Feb 23, 2015)
2015.2.23.1

Fix #9: Missing languages.json file that breaks the installation.

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.23
Source code(tar.gz)
Source code(zip)
2015.2.23(Feb 23, 2015)
2015.2.23

Feature: Using the cache is optional

Feature: Filtering stopwords

Special thanks to Taras Labiak @kissarat

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21
Source code(tar.gz)
Source code(zip)
2015.2.21(Feb 21, 2015)
2015.2.21

Feature: LANGUAGE_MAPPING is loads from stop-words/languages.json

Fix: Made paths OS-independent

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

Special thanks to Taras Labiak @kissarat
Source code(tar.gz)
Source code(zip)
2015.1.31(Feb 1, 2015)
2015.1.31

Feature #5: Decode error AND Add catalan language to LANGUAGE_MAPPING.

Feature: Update stop-words dictionary.

Source code(tar.gz)
Source code(zip)
2015.1.22(Jan 22, 2015)
2015.1.22

Feature: Tests

Feature: Python 3 support

Feature: Dev installation via zc.buildout

Feature: Continuous integration via Travis

pypi: https://pypi.python.org/pypi/stop-words/2015.1.22
Source code(tar.gz)
Source code(zip)
2015.1.19(Jan 19, 2015)
2015.1.19

Feature #3: Handle language code, cache and custom errors

Source code(tar.gz)
Source code(zip)

Owner

Alireza Savand

I am Alireza Savand, a Software Architect.

GitHub Repository https://pypi.org/project/stop-words/

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体，包括上市公司所属行业关系、行业上级关系、产品上游原材料关系、产品下游产品关系、公司主营产品、产品小类共6大类。上市公司4,654家，行业511个，产品95,559条、上游材料56,824条，上级行业480条，下游产品390条，产品小类52,937条，所属行业3,946条。

415 Jan 06, 2023

Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

577 Jan 07, 2023

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022

GPT-3: Language Models are Few-Shot Learners

GPT-3: Language Models are Few-Shot Learners arXiv link Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-trainin

12.5k Jan 05, 2023

Autoregressive Entity Retrieval

The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch. @inproceedings{decao2020autoreg

611 Dec 16, 2022

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

12 Aug 19, 2021

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

0 Jun 28, 2022

PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

PRAnCER (Platform enabling Rapid Annotation for Clinical Entity Recognition) is a web platform that enables the rapid annotation of medical terms within clinical notes. A user can highlight spans of

39 Nov 14, 2022

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

823 Dec 28, 2022

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

BERT is to NLP what AlexNet is to CV This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Iden

20 Nov 03, 2022

a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件，采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。支持简单的pinyin分词支持用户自定义break 支持用户自定义合并词

237 Nov 04, 2022

A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

19 Nov 26, 2022

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

490 Dec 15, 2022

Speech Recognition Database Management with python

Speech Recognition Database Management The main aim of this project is to recogn

2 Feb 02, 2022

NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

2k Jan 04, 2023

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

37 Sep 05, 2022

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated. This engine can later be used for downstream tasks in NLP such as Q&A, summarization, generation

1 Mar 20, 2022

A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

2 Jan 27, 2022

Get list of common stop words in various languages in Python

Related tags

Overview

Python Stop Words

Comments

Releases(2018.7.23)

2018.7.23(Jul 23, 2018)

2018.7.23

2015.2.23.1(Feb 23, 2015)

2015.2.23.1

2015.2.23(Feb 23, 2015)

2015.2.23

2015.2.21(Feb 21, 2015)

2015.2.21

2015.1.31(Feb 1, 2015)

2015.1.31

2015.1.22(Jan 22, 2015)

2015.1.22

2015.1.19(Jan 19, 2015)

2015.1.19

Owner

Alireza Savand

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

Awesome Treasure of Transformers Models Collection

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

GPT-3: Language Models are Few-Shot Learners

Autoregressive Entity Retrieval

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

a chinese segment base on crf

A CSRankings-like index for speech researchers

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

Speech Recognition Database Management with python

NLP, before and after spaCy

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

A website which allows you to play with the GPT-2 transformer