Fixes mojibake and other glitches in Unicode text, after the fact.

Overview

ftfy: fixes text for you

Travis PyPI package Docs

>>> print(fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง

Full documentation: https://ftfy.readthedocs.org

Testimonials

  • “My life is livable again!” — @planarrowspace
  • “A handy piece of magic” — @simonw
  • “Saved me a large amount of frustrating dev work” — @iancal
  • “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.” — Brennan Young
  • “Hat mir die Tage geholfen. Im Übrigen bin ich der Meinung, dass wir keine komplexen Maschinen mit Computern bauen sollten solange wir nicht einmal Umlaute sicher verarbeiten können. :D” — Bruno Ranieri
  • “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.” — /u/ocrow
  • “9.2/10” — pylint

Developed at Luminoso

Luminoso makes groundbreaking software for text analytics that really understands what words mean, in many languages. Our software is used by enterprise customers such as Sony, Intel, Mars, and Scotts, and it's built on Python and open-source technologies.

We use ftfy every day at Luminoso, because the first step in understanding text is making sure it has the correct characters in it!

Luminoso is growing fast and hiring. If you're interested in joining us, take a look at our careers page.

What it does

ftfy fixes Unicode that's broken in various ways.

The goal of ftfy is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code. This is different from taking in non-Unicode and outputting Unicode, which is not a goal of ftfy. It also isn't designed to protect you from having to write Unicode-aware code. ftfy helps those who help themselves.

Of course you're better off if your input is decoded properly and has no glitches. But you often don't have any control over your input; it's someone else's mistake, but it's your problem now.

ftfy will do everything it can to fix the problem.

Mojibake

The most interesting kind of brokenness that ftfy will fix is when someone has encoded Unicode with one standard and decoded it with a different one. This often shows up as characters that turn into nonsense sequences (called "mojibake"):

  • The word schön might appear as schön.
  • An em dash () might appear as —.
  • Text that was meant to be enclosed in quotation marks might end up instead enclosed in “ and â€<9d>, where <9d> represents an unprintable character.

ftfy uses heuristics to detect and undo this kind of mojibake, with a very low rate of false positives.

This part of ftfy now has an unofficial Web implementation by simonw: https://ftfy.now.sh/

Examples

fix_text is the main function of ftfy. This section is meant to give you a taste of the things it can do. fix_encoding is the more specific function that only fixes mojibake.

Please read the documentation for more information on what ftfy does, and how to configure it for your needs.

>>> print(fix_text('This text should be in “quotesâ€\x9d.'))
This text should be in "quotes".

>>> print(fix_text('ünicode'))
ünicode

>>> print(fix_text('Broken text&hellip; it&#x2019;s flubberific!',
...                normalization='NFKC'))
Broken text... it's flubberific!

>>> print(fix_text('HTML entities &lt;3'))
HTML entities <3

>>> print(fix_text('<em>HTML entities in HTML &lt;3</em>'))
<em>HTML entities in HTML &lt;3</em>

>>> print(fix_text('\001\033[36;44mI&#x92;m blue, da ba dee da ba '
...               'doo&#133;\033[0m', normalization='NFKC'))
I'm blue, da ba dee da ba doo...

>>> print(fix_text('LOUD NOISES'))
LOUD NOISES

>>> print(fix_text('LOUD NOISES', fix_character_width=False))
LOUD NOISES

Installing

ftfy is a Python 3 package that can be installed using pip:

pip install ftfy

(Or use pip3 install ftfy on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

If you're on Python 2.7, you can install an older version:

pip install 'ftfy<5'

You can also clone this Git repository and install it with python setup.py install.

Who maintains ftfy?

I'm Robyn Speer ([email protected]). I develop this tool as part of my text-understanding company, Luminoso, where it has proven essential.

Luminoso provides ftfy as free, open source software under the extremely permissive MIT license.

You can report bugs regarding ftfy on GitHub and we'll handle them.

Citing ftfy

ftfy has been used as a crucial data processing step in major NLP research.

It's important to give credit appropriately to everyone whose work you build on in research. This includes software, not just high-status contributions such as mathematical models. All I ask when you use ftfy for research is that you cite it.

ftfy has a citable record on Zenodo. A citation of ftfy may look like this:

Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
http://doi.org/10.5281/zenodo.2591652

In BibTeX format, the citation is::

@misc{speer-2019-ftfy,
  author       = {Robyn Speer},
  title        = {ftfy},
  note         = {Version 5.5},
  year         = 2019,
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.2591652},
  url          = {https://doi.org/10.5281/zenodo.2591652}
}
Comments
  • Bump certifi from 2021.10.8 to 2022.12.7

    Bump certifi from 2021.10.8 to 2022.12.7

    Bumps certifi from 2021.10.8 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Performance improvements using google-re2. 2 times faster to run fix_text()

    Performance improvements using google-re2. 2 times faster to run fix_text()

    Hi, thanks for the great lib!

    In our real time inference server, we are using ftfy to clean inputs coming from users. We noticed that processing time can be huge with a lot of text. So I run this little experiment to usegoogle-re2 which is a regex engine optimized for performance. On my test file of 10000 lines, I was able to clean the text, 2 times faster. On a run of 10, I'm getting 16.15 seconds with vanilla ftfy and 8.71 seconds with the optimizations made in this PR.

    As is, this PR is not mergable, its implies a big change for the lib. I think it should be better to have a way of choosing regex engine. If you are interested in merging it, I can make the necessary changes. I'm publishing it just for you and the community to know it's possible and what the expected outcomes can be. Of course, I made sure than all the tests are green.

    Anyone can test it by installing this branch pip install git+https://@github.com/ablanchard/[email protected]

    Notes on the PR :

    • re.VERBOSE is not supported by google-re2. To keep comments and line returns, I process it by "hand" using a regex. Bit of a hack but it works.
    • lookahead and lookbehind arenot supported by google-re2 so I splited the UTF8 detector and the a grave regex in 2 separate regexes in order to keep the same behavior. Meaning that UTF8_DETECTOR_RE.search() doesn't return the same results as before so you have to call the method utf8_detector(). The same idea goes for the sub method.
    • By default google-re2 uses utf8 for encoding regexes so to use binary string you have to pass options=LATIN_OPTIONS
    • I didn't migrate the surrogates for utf-16. In my understanding,it's not supported by google-re2. So I left it as it was.

    PS: Code used for the benchmark:

    import time
    import ftfy
    import pandas as pd
    import sys
    
    df = pd.read_csv(sys.argv[1])
    texts = df['input_text'].tolist()
    start_time = time.time()
    res = [ftfy.fix_text(text) for text in texts]
    print(time.time() - start_time)
    
    opened by ablanchard 0
  • Restore Python 36 support

    Restore Python 36 support

    Hi! There is not much that prohibits to still support Python 3.6 which is still widely supported on Linux distros. This PE re-enables Python 3.6 support I also removed some upper bounds on deps to avoid some issues, as highlighted in https://iscinumpy.dev/post/bound-version-constraints/ Thanks for your kind consideration!

    opened by pombredanne 0
  • İ and Ä« not detected as mojibake

    İ and ī not detected as mojibake

    Hi @rspeer. Many thanks for creating and maintaining FTFY! We're using it at Sectigo to help prevent mojibake from finding its way into string fields in the digital certificates that we issue. We've noticed a couple of mojibake sequences that FTFY doesn't currently detect and fix:

    Desired behaviour:

    $ echo "İstanbul" | iconv -t WINDOWS-1252
    İstanbul
    $ echo "Rīga" | iconv -t WINDOWS-1252
    Rīga
    

    Current FTFY behaviour:

    $ echo "İstanbul" | ftfy
    İstanbul
    $ echo "Rīga" | ftfy
    Rīga
    

    Would it be possible to make FTFY handle these cases?

    opened by robstradling 0
  • On the wish list:

    On the wish list: "Pyreneeu00ebn" being explained as "Pyreneeën 71"

    A while ago I blogged about "Pyreneeën 71" on a web-site being incorrectly represented as "Pyreneeu00ebn".

    Basically the Unicode code point U+00EB : LATIN SMALL LETTER E WITH DIAERESIS is being represented as u00eb.

    Is this something that ftfy could potentially recognise?

    Right now It does not:

    >>> ftfy.fix_and_explain("Pyreneeu00ebn")
    ExplainedText(text='Pyreneeu00ebn', explanation=[])
    
    opened by jpluimers 2
  • Any idea which encoding failure could cause

    Any idea which encoding failure could cause "beëindiging" to be printed in a letter as "beᅵindiging"?

    opened by jpluimers 0
Releases(v6.0.3)
  • v6.0.3(Aug 23, 2021)

    Updates in 6.0.x:

    • New function: ftfy.fix_and_explain() can describe all the transformations that happen when fixing a string. This is similar to what ftfy.fixes.fix_encoding_and_explain() did in previous versions, but it can fix more than the encoding.
    • fix_and_explain() and fix_encoding_and_explain() are now in the top-level ftfy module.
    • Changed the heuristic entirely. ftfy no longer needs to categorize every Unicode character, but only characters that are expected to appear in mojibake.
    • Because of the new heuristic, ftfy will no longer have to release a new version for every new version of Unicode. It should also run faster and use less RAM when imported.
    • The heuristic ftfy.badness.is_bad(text) can be used to determine whether there appears to be mojibake in a string. Some users were already using the old function sequence_weirdness() for that, but this one is actually designed for that purpose.
    • Instead of a pile of named keyword arguments, ftfy functions now take in a TextFixerConfig object. The keyword arguments still work, and become settings that override the defaults in TextFixerConfig.
    • Added support for UTF-8 mixups with Windows-1253 and Windows-1254.
    • Overhauled the documentation: https://ftfy.readthedocs.org
    • Requires Python 3.6 or later.
    Source code(tar.gz)
    Source code(zip)
  • v5.5.1(Mar 12, 2019)

Owner
Luminoso Technologies, Inc.
Luminoso Technologies, Inc.
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022
A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

MONEYBALL - ChatBot Module: 4006CEM, Class: B, Group: 5 Contributors: Jonas Djondo Roshan Kc Cole Samson Daniel Rodrigues Ihteshaam Naseer Kind remind

Jonas Djondo 1 Nov 18, 2021
Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

张宇 6 Dec 20, 2022
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 06, 2023
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
The SVO-Probes Dataset for Verb Understanding

The SVO-Probes Dataset for Verb Understanding This repository contains the SVO-Probes benchmark designed to probe for Subject, Verb, and Object unders

DeepMind 20 Nov 30, 2022
CoSENT 比Sentence-BERT更有效的句向量方案

CoSENT 比Sentence-BERT更有效的句向量方案

苏剑林(Jianlin Su) 201 Dec 12, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 03, 2023
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 06, 2023
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023