Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Last update: Jan 01, 2023

Related tags

Text Processing thefuzz

Overview

TheFuzz

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Requirements

Python 2.7 or higher
difflib
python-Levenshtein (optional, provides a 4-10x speedup in String Matching, though may result in differing results for certain cases)

For testing

pycodestyle
hypothesis
pytest

Installation

Using PIP via PyPI

pip install thefuzz

or the following to install python-Levenshtein too

pip install thefuzz[speedup]

Using PIP via Github

pip install git+git://github.com/seatgeek/[email protected]#egg=thefuzz

Adding to your requirements.txt file (run pip install -r requirements.txt afterwards)

git+ssh://[email protected]/seatgeek/[email protected]#egg=thefuzz

Manually via GIT

git clone git://github.com/seatgeek/thefuzz.git thefuzz
cd thefuzz
python setup.py install

Usage

>>> from thefuzz import fuzz
>>> from thefuzz import process

Simple Ratio

>>> fuzz.ratio("this is a test", "this is a test!")
    97

Partial Ratio

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

Token Sort Ratio

>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100 ">

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio

>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100 ">

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

Process

>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90) ">

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)

You can also pass additional parameters to extractOne method to make it use a specific scorer. A typical use case is to match file paths:

>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio) ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61) ">

>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
    ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
    ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Related tags

Overview

TheFuzz

Requirements

For testing

Installation

Usage

Simple Ratio

Partial Ratio

Token Sort Ratio

Token Set Ratio

Process

Owner

SeatGeek

Extract price amount and currency symbol from a raw text string

Parse Any Text With Python

A collection of pre-commit hooks for handling text files.

Um simulador de caixa registradora com database usando arquivos .txt

The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

Repositori untuk belajar pemrograman Python dalam bahasa Indonesia

Production First and Production Ready End-to-End Keyword Spotting Toolkit

Vector space based Information Retrieval System for Text Processing - Information retrieval

Code Jam for creating a text-based adventure game engine and custom worlds

An online markdown resume template project, based on pywebio

一款高性能敏感词(非法词/脏字)检测过滤组件，附带繁体简体互换，支持全角半角互换，汉字转拼音，模糊搜索等功能。

WorldCloud Orçamento de Estado 2022

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

Build a translation program similar to Google Translate with Python programming language and QT library

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Fixes mojibake and other glitches in Unicode text, after the fact.

Export solved codewars kata challenges to a text file.

Paranoid text spacing in Python

Split large XML files into smaller ones for easy upload

A username generator made from French Canadian most common names.