These scripts look for non-printable unicode characters in all text files in a source tree

Overview

find-unicode-control

These scripts look for non-printable unicode characters in all text files in a source tree. find_unicode_control.py should work with python2 as well as python3. It uses python-magic if available to determine file type, or simply spawns the file --mime-type command. They should be functionally the same and find_unicode_control.py could eventually get disposed.

usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]

Look for Unicode control characters

positional arguments:
  path                  Sources to analyze

optional arguments:
  -h, --help            show this help message and exit
  -p {all,bidi}, --nonprint {all,bidi}
                        Look for either all non-printable unicode characters or bidirectional control characters.
  -v, --verbose         Verbose mode.
  -d, --detailed        Print line numbers where characters occur.
  -t, --notests         Exclude tests (basically test.* as a component of path).
  -c CONFIG, --config CONFIG
                        Configuration file to read settings from.

If unicode BIDI control characters or non-printable characters are found in a file, it will print output as follows:

$ python3 find_unicode_control.py -p bidi *.c
commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
early-return.c: bidirectional control characters: {'\u2067'}
stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}

Using the -d flag, the output is more detailed, showing line numbers in files, but this mode is also slower:

find_unicode_control.py -p bidi -d .
./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
./early-return.c:4 bidirectional control characters: ['\u2067']
./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']

The optimal workflow would be to do a quick scan through a source tree and if any issues are found, do a detailed scan on only those files.

Configuration file

If files need to be excluded from the scan, make a configuration file and define a scan_exclude variable to a list of regular expressions that match the files or paths to exclude. Alternatively, add a scan_exclude_mime list with the list of mime types to ignore; this can again be a regular expression. Here is an example configuration that glibc uses:

scan_exclude = [
        # Iconv test data
        r'/iconvdata/testdata/',
        # Test case data
        r'libio/tst-widetext.input$',
        # Test script.  This is to silence the warning:
        # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
        # since the script tests mixed encoding characters.
        r'localedata/tst-langinfo.sh$']

Notes

This script was quickly hacked together to scan repositories with mostly LTR, unicode content. If you have RTL content (either in comments, literals or even identifiers in code), it will give false warnings that you need to weed out. For now you need to exclude such RTL code using scan_exclude but a long term wish list (if this remains relevant, hopefully more sophisticated RTL diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.

Owner
Siddhesh Poyarekar
Toolchain hacker and all round nice guy. My openhub profile will probably tell you more about my work: https://www.openhub.net/accounts/siddhesh
Siddhesh Poyarekar
password generator

Password generator technologies used What is? It is Password generator How to Download? Download on releases Clone repo git clone https://github.com/m

1 Dec 16, 2021
Generates a random prnt.sc link and display image.

Generates a random prnt.sc link and display image.

Emirhan 3 Oct 08, 2021
general-phylomoji: a phylogenetic tree of emoji

general-phylomoji: a phylogenetic tree of emoji

2 Dec 11, 2021
Hide new MacBook Pro notch with black wallpaper.

Hide new MacBook Pro notch with black wallpaper.

Wang Chao 1 Oct 27, 2021
API Rate Limit Decorator

ratelimit APIs are a very common way to interact with web services. As the need to consume data grows, so does the number of API calls necessary to re

Tomas Basham 575 Jan 05, 2023
Python USD rate in RUB parser

Python EUR and USD rate parser. Python USD and EUR rate in RUB parser. Parsing i

Andrew 2 Feb 17, 2022
🦩 A Python tool to create comment-free Jupyter notebooks.

Pelikan Pelikan lets you convert notebooks to comment-free notebooks. In other words, It removes Python block and inline comments from source cells in

Hakan Γ–zler 7 Nov 20, 2021
The producer-consumer problem implemented with threads in Python

This was developed using a Python virtual environment, I would strongly recommend to do the same if you want to clone this repository. How to run this

Omar Beltran 1 Oct 30, 2021
A python script to generate wallpaper

wallpaper eits Warning You need to set the path to Robot Mono font in the source code. (Settings are in the main function) Usage A script that given a

Henrique Tsuyoshi Yara 5 Dec 02, 2021
Michael Vinyard's utilities

Install vintools To download this package from pypi: pip install vintools Install the development package To download and install the developmen

Michael Vinyard 2 May 22, 2022
cssOrganizer - organize a css file by grouping them into categories

This python project was created to scan through a CSS file and produce a more organized CSS file by grouping related CSS Properties within selectors. Created in my spare time for fun and my own utili

Andrew Espindola 0 Aug 31, 2022
A fast Python implementation of Ac Auto Mechine

A fast Python implementation of Ac Auto Mechine

Jin Zitian 1 Dec 07, 2021
'ToolBurnt' A Set Of Tools In One Place =}

'ToolBurnt' A Set Of Tools In One Place =}

MasterBurnt 5 Sep 10, 2022
πŸ’‰ μ½”λ‘œλ‚˜ μž”μ—¬λ°±μ‹  μ˜ˆμ•½ 맀크둜 μ»€μŠ€ν…€ λΉŒλ“œ (속도 ν–₯상 버전)

Korea-Covid-19-Vaccine-Reservation μ½”λ‘œλ‚˜ μž”μ—¬ λ°±μ‹  μ˜ˆμ•½ 맀크둜λ₯Ό 기반으둜 ν•œ μ»€μŠ€ν…€ λΉŒλ“œμž…λ‹ˆλ‹€. 더 λΉ λ₯Έ λ°±μ‹  μ˜ˆμ•½μ„ λͺ©ν‘œλ‘œ ν•˜λ©°, 속도λ₯Ό μš°μ„ ν•˜κΈ° λ•Œλ¬Έμ— μ‚¬μš©μžλŠ” 이에 λŒ€μ²˜κ°€ κ°€λŠ₯ν•΄μ•Ό ν•©λ‹ˆλ‹€. μ§€μ •ν•œ μ’Œν‘œ λ‚΄ λŒ€κΈ°μ€‘μΈ λ³‘μ›μ—μ„œ μž”μ—¬ λ°±μ‹ 

Queue.ri 21 Aug 15, 2022
Random Number Generator

Application for generating a random number.

Michael J Bailey 1 Oct 12, 2021
A Program that generates and checks Stripe keys 24x7.

A Program that generates and checks Stripe keys 24x7. This was made only for Educational Purposes, I'm not responsible for the damages cause by you

iNaveen 18 Dec 17, 2022
A utility tool to create .env files

A utility tool to create .env files dump-env takes an .env.template file and some optional environmental variables to create a new .env file from thes

wemake.services 89 Dec 08, 2022
Edit SRT files to delay subtitle time-stamps.

subtitle-delay A program written in Python that directly edits SRT file to delay the subtitles. Features: Will throw an error if delaying with negativ

8 Jul 17, 2022
Finger is a function symbol recognition engine for binary programs

Finger is a function symbol recognition engine for binary programs

332 Jan 01, 2023
Install, run, and update apps without root and only in your home directory

Qube Apps Install, run, and update apps in the private storage of a Qube. Build and install in Qubes Get the code: git clone https://github.com/micahf

Micah Lee 26 Dec 27, 2022