Spam filtering made easy for you

Overview

spammy

PyPI version Build Status Python Versions percentagecov Requirements Status License

Author: Tasdik Rahman
Latest version: 1.0.3

1   Overview

spammy : Spam filtering at your service

spammy powers the web app https://plino.herokuapp.com

2   Features

  • train the classifier on your own dataset to classify your emails into spam or ham
  • Dead simple to use. See usage
  • Blazingly fast once the classifier is trained. (See benchmarks)
  • Custom exceptions raised so that when you miss something, spammy tells you where did you go wrong in a graceful way
  • Written in uncomplicated python
  • Built on top of the giant shoulders of nltk

3   Example

[back to top]

  • Your data directory structure should be something similar to
$ tree /home/tasdik/Dropbox/projects/spammy/examples/test_dataset
/home/tasdik/Dropbox/projects/spammy/examples/test_dataset
├── ham
│   ├── 5458.2001-04-25.kaminski.ham.txt
│   ├── 5459.2001-04-25.kaminski.ham.txt
│   ...
│   ...
│   └── 5851.2001-05-22.kaminski.ham.txt
└── spam
    ├── 4136.2005-07-05.SA_and_HP.spam.txt
    ├── 4137.2005-07-05.SA_and_HP.spam.txt
    ...
    ...
    └── 5269.2005-07-19.SA_and_HP.spam.txt

Example

>>> import os
>>> from spammy import Spammy
>>>
>>> directory = '/home/tasdik/Dropbox/projects/spamfilter/data/corpus3'
>>>
>>> # directory structure
>>> os.listdir(directory)
['spam', 'Summary.txt', 'ham']
>>> os.listdir(os.path.join(directory, 'spam'))[:3]
['4257.2005-04-06.BG.spam.txt', '0724.2004-09-21.BG.spam.txt', '2835.2005-01-19.BG.spam.txt']
>>>
>>> # Spammy object created
>>> cl = Spammy(directory, limit=100)
>>> cl.train()
>>>
>>> SPAM_TEXT = \
... """
... My Dear Friend,
...
... How are you and your family? I hope you all are fine.
...
... My dear I know that this mail will come to you as a surprise, but it's for my
... urgent need for a foreign partner that made me to contact you for your sincere
... genuine assistance My name is Mr.Herman Hirdiramani, I am a banker by
... profession currently holding the post of Director Auditing Department in
... the Islamic Development Bank(IsDB)here in Ouagadougou, Burkina Faso.
...
... I got your email information through the Burkina's Chamber of Commerce
... and industry on foreign business relations here in Ouagadougou Burkina Faso
... I haven'disclose this deal to any body I hope that you will not expose or
... betray this trust and confident that I am about to repose on you for the
... mutual benefit of our both families.
...
... I need your urgent assistance in transferring the sum of Eight Million,
... Four Hundred and Fifty Thousand United States Dollars ($8,450,000:00) into
... your account within 14 working banking days This money has been dormant for
... years in our bank without claim due to the owner of this fund died along with
... his entire family and his supposed next of kin in an underground train crash
... since years ago. For your further informations please visit
... (http://news.bbc.co.uk/2/hi/5141542.stm)
... """
>>> cl.classify(SPAM_TEXT)
'spam'
>>>

3.1   Accuracy of the classifier

>>> from spammy import Spammy
>>> directory = '/home/tasdik/Dropbox/projects/spammy/examples/training_dataset'
>>> cl = Spammy(directory, limit=300)  # training on only 300 spam and ham files
>>> cl.train()
>>> data_dir = '/home/tasdik/Dropbox/projects/spammy/examples/test_dataset'
>>>
>>> cl.accuracy(directory=data_dir, label='spam', limit=300)
0.9554794520547946
>>> cl.accuracy(directory=data_dir, label='ham', limit=300)
0.9033333333333333
>>>

NOTE:

4   Installation

[back to top]

NOTE: spammy currently supports only python2

Install the dependencies first

$ pip install nltk==3.2.1, beautifulsoup4==4.4.1

To install use pip:

$ pip install spammy

or if you don't have pip``use ``easy_install

$ easy_install spammy

Or build it yourself (only if you must):

$ git clone https://github.com/tasdikrahman/spammy.git
$ python setup.py install

4.1   Upgrading

To upgrade the package,

$ pip install -U spammy

4.2   Installation behind a proxy

If you are behind a proxy, then this should work

$ pip --proxy [username:password@]domain_name:port install spammy

5   Benchmarks

[back to top]

Spammy is blazingly fast once trained

Don't believe me? Have a look

>>> import timeit
>>> from spammy import Spammy
>>>
>>> directory = '/home/tasdik/Dropbox/projects/spamfilter/data/corpus3'
>>> cl = Spammy(directory, limit=100)
>>> cl.train()
>>> SPAM_TEXT_2 = \
... """
... INTERNATIONAL MONETARY FUND (IMF)
... DEPT: WORLD DEBT RECONCILIATION AGENCIES.
... ADVISE: YOUR OUTSTANDING PAYMENT NOTIFICATION
...
... Attention
... A power of attorney was forwarded to our office this morning by two gentle men,
... one of them is an American national and he is MR DAVID DEANE by name while the
... other person is MR... JACK MORGAN by name a CANADIAN national.
... This gentleman claimed to be your representative, and this power of attorney
... stated that you are dead; they brought an account to replace your information
... in other to claim your fund of (US$9.7M) which is now lying DORMANT and UNCLAIMED,
...  below is the new account they have submitted:
...                     BANK.-HSBC CANADA
...                     Vancouver, CANADA
...                     ACCOUNT NO. 2984-0008-66
...
... Be further informed that this power of attorney also stated that you suffered.
... """
>>>
>>> def classify_timeit():
...    result = cl.classify(SPAM_TEXT_2)
...
>>> timeit.repeat(classify_timeit, number=5)
[0.1810469627380371, 0.16121697425842285, 0.16121196746826172]
>>>

6   Contributing

[back to top]

Refer CONTRIBUTING page for details

6.1   Roadmap

  • Include more algorithms for increased accuracy
  • python3 support

7   Licensing

[back to top]

Spammy is built by Tasdik Rahman and licensed under GPLv3.

spammy Copyright (C) 2016 Tasdik Rahman([email protected])

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

You can find a full copy of the LICENSE file here

8   Credits

[back to top]

If you'd like give me credit somewhere on your blog or tweet a shout out to @tasdikrahman, well hey, I'll take it.

9   Donation

If you have found my little bits of software of any use to you, you can help me pay my internet bills :)

Paypal badge

Instamojo

gratipay

patreon

Owner
Tasdik Rahman
Engineering Platform @gojek, former SRE @razorpay. Weekend chef, Backpacker, past contributor to @oVirt (Redhat).
Tasdik Rahman
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my the

Corentin Jemine 38.5k Jan 03, 2023
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
Journalism AI – Quotes extraction for modular journalism

Quote extraction for modular journalism (JournalismAI collab 2021)

Journalism AI collab 2021 207 Dec 25, 2022
Continuously update some NLP practice based on different tasks.

NLP_practice We will continuously update some NLP practice based on different tasks. prerequisites Software pytorch = 1.10 torchtext = 0.11.0 sklear

0 Jan 05, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Neural network sequence labeling model

Sequence labeler This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be u

Marek Rei 250 Nov 03, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 Corpora 📃 Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models 🤖 RoBERTa-base BNE: https://huggingface.co

PlanTL-SANIDAD 203 Dec 20, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023