A collection of pre-commit hooks for handling text files.

Last update: Oct 28, 2022

Related tags

Text Processing texthooks

Overview

texthooks

A collection of pre-commit hooks for handling text files.

In particular, hooks for handling unicode characters which may be undesirable in a repository.

Usage with pre-commit

To use with pre-commit, include this repo and the desired hooks in .pre-commit-config.yaml:

- repo: https://github.com/sirosen/texthooks
  rev: 0.1.0
  hooks:
    - id: fix-smartquotes
    - id: fix-ligatures

Standalone Usage

Each hook is usable as a CLI script. Simply

pip install texthooks

and then invoke, e.g.

fix-smartquotes FILENAME

Supported Hooks

`fix-smartquotes`

This fixes copy-paste from some applications which replace double-quotes with curly quotes. It does not convert corner brackets, braile quotation marks, or angle quotation marks. Those characters are not typically the result of copy-paste errors, so they are allowed.

Low quotation marks vary in usage and meaning by language, and some languages use quotation marks which are facing "outwards" (opposite facing from english). For the most part, these and exotic characters (double-prime quotes) are ignored.

In files with the offending marks, they are replaced and the run is marked as failed.

Overriding Quotation Characters

Two options are available for specifying exactly which characters will be replaced. For ease of use, they are specified as hex-encoded unicode codepoints.

Suppose you wanted to avoid replacing the "Heavy single comma quotation mark ornament" (275C) and the "Heavy single turned comma quotation mark ornament" (275B) characters. You could override the single quote codepoints as follows:

- repo: https://github.com/sirosen/texthooks
  rev: 0.1.0
  hooks:
    - id: fix-smartquotes
      # replace default single quote chars with this set:
      # apostrophe, fullwidth apostrophe, left single quote, single high
      # reversed-9 quote, right single quote
      args: ["--single-quote-codepoints", "0027,FF07,2018,201B,2019"]

fix-ligatures

Automatically find and replace ligature characters with their ascii equivalents.

This replaces liguatures which may be created by programs like LaTeX for presentation with their strictly-equivalent ASCII counterparts. For example, fi and ff may be ligature-ized.

This hook converts these back into ASCII so that tools like grep will behave as expected.

forbid-bidi-controls

This is checker which forbids the use of unicode bidirectional text control characters.

These are directional formatting characters which can be used to construct text with unexpected or unclear semantics. For example, in programming languages which allow bidirectional text in statements, "X" = ייִדיש can be written with right-to-left reversal to mean that the variable ייִדיש is assigned a value of "X".

CHANGELOG

0.2.2

Fix a bug in CLI argument handling for all hooks

0.2.1

Fix a typo in forbid-bidi-controls entrypoint

0.2.0

Add the forbid-bidi-controls hook
Adjust the handling of file encodings. Files will be read with UTF-8 encoding by default in most cases.

0.1.0

Initial release with fix-ligatures and fix-smartquotes hooks

A collection of pre-commit hooks for handling text files.

Related tags

Overview

texthooks

Usage with pre-commit

Standalone Usage

Supported Hooks

`fix-smartquotes`

Overriding Quotation Characters

fix-ligatures

forbid-bidi-controls

CHANGELOG

0.2.2

0.2.1

0.2.0

0.1.0

Owner

Stephen Rosen

Fuzzy String Matching in Python

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

Export solved codewars kata challenges to a text file.

Auto translate Localizable.strings for multiple languages in Xcode

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

This project is a small tool for processing url-containing texts delivered by HUAWEI Share on Windows.

基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端

Goblin-sim - Procedural fantasy world generator

A python tool one can extract the "hash" from a WINDOWS HELLO PIN

Vector space based Information Retrieval System for Text Processing - Information retrieval

一款高性能敏感词(非法词/脏字)检测过滤组件，附带繁体简体互换，支持全角半角互换，汉字转拼音，模糊搜索等功能。

Utility for Text Normalisation or Inverse Normalisation

🍋 A Python package to process food

Build a translation program similar to Google Translate with Python programming language and QT library

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

An extension to detect if the articles content match its title.

Wordle strategy: Find frequency of letters appearing in 5-letter words in the English language

The app gets your sutitle.srt and proccess it to extract sentences

Amazing GitHub Template - Sane defaults for your next project!

Word and phrase lists in CSV