Simple, hackable offline speech to text - using the VOSK-API.

Related tags

Audionerd-dictation
Overview

Nerd Dictation

Offline Speech to Text for Desktop Linux.

This is a utility that provides simple access speech to text for using in Linux without being tied to a desktop environment.

Simple
This is a single file Python script with minimal dependencies.
Hackable
User configuration lets you manipulate text using Python string operations.
Zero Overhead
As this relies on manual activation there are no background processes.

Dictation is accessed manually with begin/end commands.

This uses the excellent vosk-api.

Usage

It is suggested to bind begin/end/cancel to shortcut keys.

nerd-dictation begin
nerd-dictation end

For details on how this can be used, see: nerd-dictation --help and nerd-dictation begin --help.

Features

Specific features include:

Numbers as Digits

Optional conversion from numbers to digits.

So Three million five hundred and sixty second becomes 3,000,562nd.

A series of numbers (such as reciting a phone number) is also supported.

So Two four six eight becomes 2,468.

Time Out
Optionally end speech to text early when no speech is detected for a given number of seconds. (without an explicit call to end which is otherwise required).
Output Type
Output can simulate keystroke events (default) or simply print to the standard output.
User Configuration Script
User configuration is just a Python script which can be used to manipulate text using Python's full feature set.

See nerd-dictation begin --help for details on how to access these options.

Dependencies

  • Python 3.
  • The VOSK-API.
  • parec command (for recording from pulse-audio).
  • xdotool command to simulate keyboard input.

Install

pip3 install vosk
git clone https://github.com/ideasman42/nerd-dictation.git
cd nerd-dictation
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
mv vosk-model-small-en-us-0.15 model

To test dictation:

./nerd-dictation begin --vosk-model-dir=./model &
# Start speaking.
./nerd-dictation end
  • Reminder that it's up to you to bind begin/end/cancel to actions you can easily access (typically key shortcuts).

  • To avoid having to pass the --vosk-model-dir argument, copy the model to the default path:

    mkdir -p ~/.config/nerd-dictation
    mv ./model ~/.config/nerd-dictation

Hint

Once this is working properly you may wish to download one of the larger language models for more accurate dictation. They are available here.

Configuration

This is an example of a trivial configuration file which simply makes the input text uppercase.

# ~/.config/nerd-dictation/nerd-dictation.py
def nerd_dictation_process(text):
    return text.upper()

A more comprehensive configuration is included in the examples/ directory.

Hints

  • The processing function can be used to implement your own actions using keywords of your choice. Simply return a blank string if you have implemented your own text handling.
  • Context sensitive actions can be implemented using command line utilities to access the active window.

Paths

Local Configuration
~/.config/nerd-dictation/nerd-dictation.py
Language Model

~/.config/nerd-dictation/model

Note that --vosk-model-dir=PATH can be used to override the default.

Details

  • Typing in results will never press enter/return.
  • Pulse audio is used for recording.
  • Recording and speech to text a performed in parallel.

Examples

Store the result of speech to text as a variable in the shell:

SPEECH="$(nerd-dictation begin --timeout=1.0 --output=STDOUT)"

Limitations

  • Text from VOSK is all lower-case, while the user configuration can be used to set the case of common words like I this isn't very convenient (see the example configuration for details).

  • For some users the delay in start up may be noticeable on systems with slower hard disks especially when running for the 1st time (a cold start).

    This is a limitation with the choice not to use a service that runs in the background. Recording begins before any the speech-to-text components are loaded to mitigate this problem.

Further Work

  • And a general solution to capitalize words (proper nouns for example).
  • Preview output while dictating.
  • Wayland support (this should be quite simple to support and mainly relies on a replacement for xdotool).
  • Add a setup.py for easy installation on uses systems.
  • Possibly other speech to text engines (only if they provide some significant benefits).
  • Possibly support Windows & macOS.
Owner
Campbell Barton
Campbell Barton
Python implementation of the Short Term Objective Intelligibility measure

Python implementation of STOI Implementation of the classical and extended Short Term Objective Intelligibility measures Intelligibility measure which

Pariente Manuel 250 Dec 21, 2022
Terminal-based audio-to-text converter

att Terminal-based audio-to-text converter Project description A terminal-based audio-to-text converter written in python, enabling you to convert .wa

Sven Eschlbeck 4 Dec 15, 2022
A simple voice detection system which can be applied practically for designing a device with capability to detect a baby’s cry and automatically turning on music

Auto-Baby-Cry-Detection-with-Music-Player A simple voice detection system which can be applied practically for designing a device with capability to d

2 Dec 15, 2021
Linear Prediction Coefficients estimation from mel-spectrogram implemented in Python based on Levinson-Durbin algorithm.

LPC_for_TTS Linear Prediction Coefficients estimation from mel-spectrogram implemented in Python based on Levinson-Durbin algorithm. 基于Levinson-Durbin

Zewang ZHANG 58 Nov 17, 2022
A Music Player Bot for Discord Servers

A Music Player Bot for Discord Servers

Halil Acar 2 Oct 25, 2021
commonfate 📦commonfate 📦 - Common Fate Model and Transform.

Common Fate Transform and Model for Python This package is a python implementation of the Common Fate Transform and Model to be used for audio source

Fabian-Robert Stöter 18 Jan 08, 2022
Anki vector Music ❤ is the best and only Telegram VC player with playlists, Multi Playback, Channel play and more

Anki Vector Music 🎵 A bot that can play music on Telegram Group and Channel Voice Chats Available on telegram as @Anki Vector Music Features 🔥 Thumb

Damantha Jasinghe 12 Nov 12, 2022
An app made in Python using the PyTube and Tkinter libraries to download videos and MP3 audio.

yt-dl (GUI Edition) An app made in Python using the PyTube and Tkinter libraries to download videos and MP3 audio. How do I download this? Windows: Fi

1 Oct 23, 2021
A simple python script to play bell sound in your system infinitely, just for fun and experimental purposes

A simple python script to play bell sound in your system infinitely, just for fun and experimental purposes

نافع الهلالي 1 Oct 29, 2021
Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

MediumVC MediumVC is an utterance-level method towards any-to-any VC. Before that, we propose SingleVC to perform A2O tasks(Xi → Ŷi) , Xi means utter

谷下雨 47 Dec 25, 2022
Python CD-DA ripper preferring accuracy over speed

Whipper Whipper is a Python 3 (3.6+) CD-DA ripper based on the morituri project (CDDA ripper for *nix systems aiming for accuracy over speed). It star

671 Jan 04, 2023
Small Python application that links a Digico console and Reaper, handling automatic marker insertion and tracking.

Digico-Reaper-Link This is a small GUI based helper application designed to help with using Digico's Copy Audio function with a Reaper DAW used for re

Justin Stasiw 10 Oct 24, 2022
convert-to-opus-cli is a Python CLI program for converting audio files to opus audio format.

convert-to-opus-cli convert-to-opus-cli is a Python CLI program for converting audio files to opus audio format. Installation Must have installed ffmp

4 Dec 21, 2022
Reading list for research topics in sound event detection

Sound event detection aims at processing the continuous acoustic signal and converting it into symbolic descriptions of the corresponding sound events present at the auditory scene.

Soham 64 Jan 05, 2023
Gateware for the Terasic/Arrow DECA board, to become a USB2 high speed audio interface

DECA USB Audio Interface DECA based USB 2.0 High Speed audio interface Status / current limitations enumerates as class compliant audio device on Linu

Hans Baier 16 Mar 21, 2022
Music player and music library manager for Linux, Windows, and macOS

Ex Falso / Quod Libet - A Music Library / Editor / Player Quod Libet is a music management program. It provides several different ways to view your au

Quod Libet 1.2k Jan 07, 2023
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Project DeepSpeech DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Spee

Mozilla 20.8k Jan 03, 2023
MUSIC-AVQA, CVPR2022 (ORAL)

Audio-Visual Question Answering (AVQA) PyTorch code accompanies our CVPR 2022 paper: Learning to Answer Questions in Dynamic Audio-Visual Scenarios (O

44 Dec 23, 2022
Voice to Text using Raspberry Pi

This module will help to convert your voice (speech) into text using Speech Recognition Library. You can control the devices or you can perform the desired tasks by the word recognition

Raspberry_Pi Pakistan 2 Dec 15, 2021
Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Leo 324 Dec 26, 2022