These scripts look for non-printable unicode characters in all text files in a source tree

Overview

find-unicode-control

These scripts look for non-printable unicode characters in all text files in a source tree. find_unicode_control.py should work with python2 as well as python3. It uses python-magic if available to determine file type, or simply spawns the file --mime-type command. They should be functionally the same and find_unicode_control.py could eventually get disposed.

usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]

Look for Unicode control characters

positional arguments:
  path                  Sources to analyze

optional arguments:
  -h, --help            show this help message and exit
  -p {all,bidi}, --nonprint {all,bidi}
                        Look for either all non-printable unicode characters or bidirectional control characters.
  -v, --verbose         Verbose mode.
  -d, --detailed        Print line numbers where characters occur.
  -t, --notests         Exclude tests (basically test.* as a component of path).
  -c CONFIG, --config CONFIG
                        Configuration file to read settings from.

If unicode BIDI control characters or non-printable characters are found in a file, it will print output as follows:

$ python3 find_unicode_control.py -p bidi *.c
commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
early-return.c: bidirectional control characters: {'\u2067'}
stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}

Using the -d flag, the output is more detailed, showing line numbers in files, but this mode is also slower:

find_unicode_control.py -p bidi -d .
./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
./early-return.c:4 bidirectional control characters: ['\u2067']
./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']

The optimal workflow would be to do a quick scan through a source tree and if any issues are found, do a detailed scan on only those files.

Configuration file

If files need to be excluded from the scan, make a configuration file and define a scan_exclude variable to a list of regular expressions that match the files or paths to exclude. Alternatively, add a scan_exclude_mime list with the list of mime types to ignore; this can again be a regular expression. Here is an example configuration that glibc uses:

scan_exclude = [
        # Iconv test data
        r'/iconvdata/testdata/',
        # Test case data
        r'libio/tst-widetext.input$',
        # Test script.  This is to silence the warning:
        # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
        # since the script tests mixed encoding characters.
        r'localedata/tst-langinfo.sh$']

Notes

This script was quickly hacked together to scan repositories with mostly LTR, unicode content. If you have RTL content (either in comments, literals or even identifiers in code), it will give false warnings that you need to weed out. For now you need to exclude such RTL code using scan_exclude but a long term wish list (if this remains relevant, hopefully more sophisticated RTL diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.

Owner
Siddhesh Poyarekar
Toolchain hacker and all round nice guy. My openhub profile will probably tell you more about my work: https://www.openhub.net/accounts/siddhesh
Siddhesh Poyarekar
Homebase Name Changer for Fortnite: Save the World.

Homebase Name Changer This program allows you to change the Homebase name in Fortnite: Save the World. How to use it? After starting the HomebaseNameC

PRO100KatYT 7 May 21, 2022
Python USD rate in RUB parser

Python EUR and USD rate parser. Python USD and EUR rate in RUB parser. Parsing i

Andrew 2 Feb 17, 2022
Nmap script to guess* a GitLab version.

gitlab-version-nse Nmap script to guess* a GitLab version. Usage https://github.com/righel/gitlab-version-nse cd gitlab-version-nse nmap target --s

Luciano Righetti 120 Dec 05, 2022
✨ Un chois aléatoire d'un article sur Wikipedia totalement fait en Python par moi, et en français.

Wikipedia Random Article ❗ Un chois aléatoire d'un article sur Wikipedia totalement fait en Python par moi, et en français. 🔮 Grâce a une requète a w

MrGabin 4 Jul 18, 2021
Pampy: The Pattern Matching for Python you always dreamed of.

Pampy: Pattern Matching for Python Pampy is pretty small (150 lines), reasonably fast, and often makes your code more readable and hence easier to rea

Claudio Santini 3.5k Jan 06, 2023
Install, run, and update apps without root and only in your home directory

Qube Apps Install, run, and update apps in the private storage of a Qube Building instrutions

Micah Lee 26 Dec 27, 2022
This utility synchronises spelling dictionaries from various tools with each other.

This utility synchronises spelling dictionaries from various tools with each other. This way the words that have been trained on MS Office are also correctly checked in vim or Firefox. And vice versa

Patrice Neff 2 Feb 11, 2022
Python code to generate and store certificates automatically , using names from a csv file

WOC-certificate-generator Python code to generate and store certificates automatically , using names from a csv file IMPORTANT In order to make the co

Google Developer Student Club - IIIT Kalyani 10 May 26, 2022
NFT-Generator is the best way to generate thousands of NFTs quick and easily with Python.

NFT-Generator is the best way to generate thousands of NFTs quick and easily with Python. Just add your files, set your configuration and run the scri

78 Dec 27, 2022
Factoral Methods using two different method

Factoral-Methods-using-two-different-method Here, I am finding the factorial of a number by using two different method. The first method is by using f

Sachin Vinayak Dabhade 4 Sep 24, 2021
JavaScript-style async programming for Python.

promisio JavaScript-style async programming for Python. Examples Create a promise-based async function using the promisify decorator. It works on both

Miguel Grinberg 191 Dec 30, 2022
A hashtag from string extract python module

A hashtag from string extract python module

Fayas Noushad 3 Aug 10, 2022
Local backup made easy, with Python and shutil

KTBackup BETA Local backup made easy, with Python and shutil Features One-command backup and restore Minimalistic (only using stdlib) Convenient direc

kelptaken 1 Dec 27, 2021
Macro recording and metaprogramming in Python

macro-kit is a package for efficient macro recording and metaprogramming in Python using abstract syntax tree (AST).

8 Aug 31, 2022
✨ Un DNS Resolver totalement fait en Python par moi, et en français

DNS Resolver ❗ Un DNS Resolver totalement fait en Python par moi, et en français. 🔮 Grâce a une adresse (url) vous pourrez avoir l'ip ainsi que le DN

MrGabin 3 Jun 06, 2021
A simple package for handling variables in string.

A simple package for handling string variables. Welcome! This is a simple package for handling variables in string, You can add or remove variables wi

1 Dec 31, 2021
A library for interacting with Path of Exile game and economy data, and a unique loot filter generation framework.

wraeblast A library for interfacing with Path of Exile game and economy data, and a set of item filters geared towards trade league players. Filter Ge

David Gidwani 29 Aug 28, 2022
async parser for JET

This project is mainly aims to provide an async parsing option for NTDS.dit database file for obtaining user secrets.

15 Mar 08, 2022
Use generator for range function

Use the generator for the range function! installation method: pip install yrange How to use: First import yrange in your application. You can then wo

1 Oct 28, 2021
A collection of common regular expressions bundled with an easy to use interface.

CommonRegex Find all times, dates, links, phone numbers, emails, ip addresses, prices, hex colors, and credit card numbers in a string. We did the har

Madison May 1.5k Dec 31, 2022