automatic speech recognition (ASR) real-time processing and analysis of live speech and its various stages of decomposition into phonemes and further lexical and semantic analysis

voiceprint a set of measurable characteristics of a human voice that uniquely identifies an individual. These characteristics, which are based on the physical configuration of a speaker's mouth and throat, can be expressed as a mathematical formula. The term applies to a vocal sample recorded for that purpose, the derived mathematical formula, and its graphical representation. Mainly used in voice ID systems for user authentification.

interactive voice response (IVR) / voice response unit (VRU) a technological system for human voice interaction with a computational system, and via DTMF tones input using a keypad, used most widely in automated purchase and banking systems, clinical trials in medical and pharma research, and televoting on programs like Big Brother. Contrast with an automated attendant, a system whose role is to route calls.

phoneme (phone) one unit of spoken sound, described as being part of a set or equivalence class with other similarly articulated sounds in a given language. Phonemic notation should be considered an abstract representational system for a segmentation of speech utterances informed by both written representation and sonic information, sounds of actual speech in lived practice make up a corresponding phonetic realization. The surface form or actualization of the phoneme's potential. Phonemic taxonomies are by no means universally agreed upon, even within a single language.

allophones different sounds in lived speech that are realizations of the same phoneme

diphone parts of phones between two consecutive phones, useful in speech recognition contexts as transistions between phones are more informative than stable regions in word identification

triphone another useful analysis concept, a phone taken as three sub-phonetic parts: the first part which depends on its preceding phone, the middle part that is stable, and the next part depending on the subsequent phone.

free variation the phenomenon of two or more sounds appearing in the same utterance without changing the meaning of the utterance

orthographic depth describes the degree to which a specific alphabetic orthographic system deviates from direct one-to-one mapping of letters to phonemes. shallow (or transparent) orthographies are often called phonemic orthographies, include languages such as Spanish, Italian and Finnish. A very interesting theoretical subject, for more information see the Wikipedia entry on orthographic depth.


behavioral biometric / behaviometric a biometric that is based on a learned/behavioral trait of an individual, e.g. speech patterns, signatures and keystrokes. Contrast with physical biometrics such as fingerprints and retinal scans.


 {function: definition}

Fluxus on Speech

from Old French flus "a flowing, a rolling; a bleeding" from Latin fluxus (adj.) "flowing, loose, slack"

Originally "excessive flow" (of blood or excrement), it also was an early name for "dysentery;" sense of "continuous succession of changes" is first recorded 1620s.

Speech is a continuous stream where rather stable states mix with dynamically changed states. 

{kind: quote}

Speech Analysis Sketchpad

Enghouse Interactive’s Real-Time Speech Analytics is the first software solution offering fully automated quality assurance and call optimization for every call. Innovative speech analysis technology allows organizations to monitor and improve conversations in real time, as well as evaluate call recordings.


The solution analyzes agent and customer speech to provide live feedback to agents, team leaders and quality assurance teams about what is being said and how it is being said. It monitors stress levels, speech clarity and script adherence, all whilst the call is in progress.


"Your Voiceprint Will be Your Key", Speech Technology (1998)

A History of Voiceprint, 2018 Language Log

"Voice Analysis Should be Used with Caution in Court", Scientific American, 2017

Legal History of Voice Forensics in the United States

Survey of audio forensic history, Sound on Sound




Simple phonemic classification algorithm, Wikimedia Commons

In his essay, "The Storyteller", Walter Benjamin reflects on the social uses of memory. He contrasts storytelling, a communicative form relying solely on memory, to information, or, the "communications of modernity". He describes memory as having a mediating function between different generations and individual experiences.


glottal stop



C / Java / Python - BSD License


designed for low-resource platforms, focus is on application development and not on research. Support for US/UK English, French, Mandarin, German, Dutch, Russian with training utilities for building new models.


Offers two libraries:

* pocketsphinx (C, portable and embeddable) - depends on another library Sphinxbase.

   * pocketsphinx-python bindings https://github.com/bambocher/pocketsphinx-python (a bit clunky - but works)


* Sphinx4 (Pure Java, more flexible)




uses MFCC (mel-cepstrum) features with noise tracking and spectral subtraction for noise reduction. phones are detected using 4000+ distinct short sound detectors called 'senones', which can have a complex relationship to context beyond simple triphone analysis (e.g. can be a complex function defined by a decision tree or other method).


uses Hidden Markov Models for discrete time-series analysis at various levels:

1. subword units (from phones)

2. words (from subword units) - words are important because they greatly restrict allowable combinations of phones, fillers (breath, um, cough)

3. utterances, words+fillers that are separate chunks of audio between pauses. They do not necessarily match sentences, which are a semantic concept.


models used in recognition:

* acoustic models, containing acoustic properties of senones. both context-dependent and context-independent

* a phonetic dictionary for mapping words to phones

* a language model used to restrict word search. defining which words could potentially follow previously recognized words. the most common language models are n-gram language models (statistics of word sequences) and finite state language models (weighted finite-state autonomation).


common recognition process:

1. split waveform at utterances (by silences)

2. divide speech into frames (typically 10 ms), extracting a 39-dimensional feature vector from each frame. 


detection of senones uses a gaussian mixture of its three states, which are matched against a HMM. 




the use of 12 or 13 MFCC coefficients seem to be due to historical reasons in many of the reported cases. The choice of the number of MFCCs to include in an ASR system is largely empirical. To understand why any specific number of cepstral coefficients is used, you could do worse than look at very early (pre-HMM) papers. When using DTW using Euclidean or even Mahalanobis distances, it quickly became apparent that the very high cepstral coefficients were not helpful for recognition, and to a lesser extent, neither were the very low ones. The most common solution was to “lifter” the MFCCs - i.e. apply a weighting function to them to emphasise the mid-range coefficients. These liftering functions were “optimised” by a number of researchers, but they almost always ended up being close to zero by the time you got to the 12th coefficient.

In practice, the optimal number of coefficients depends on the quantity of training data, the details of the training algorithm (in particular how well the PDFs can be modelled as the dimensionality of the feature space increases), the number of Gaussian mixtures in the HMMs, the speaker and background noise characteristics, and sometimes the available computing resources.

In semicontinuous models CMUSphinx uses specific packing of derivatives to optimize vector quantization and thus compress model better. Through years various features were used. Mostly they were selected by experiment.

Spectral subtraction of noise is one thing which differs CMUSphinx MFCC from other popular MFCC implementations, it is a simple extension that provides robustness to noise because it tracks and subtracts stable noise component in mel filter energy domain.


 {function: survey}



High level phonetic research software.

Includes command line tools and scripting environment.



 {function: survey}



Hidden Markov models work only if you know not just the phoneme context probabilities, but the word context probabilities. The list of viable three-word combinations is very very long. With neural networks, the system will assign a probability to something in context without having to know every possible context. It does it by trial and error. The others know the probability, because they have a list of all the hits and misses and divide one by the other. Now, almost all speech recognition systems use neural nets in some way.

~ Matthew Karas

from Deep Learning with Dr Tony Robinson (2017)

 {kind: quote}

Contour Spectrogram from 1962 study by Lawrence Kesta.
1962 Nature and 1974 Police Law Quarterly

The prize for developing a successful speech recognition technology is enormous. Speech is the quickest and most efficient way for humans to communicate. Speech recognition has the potential of replacing writing, typing, keyboard entry, and the electronic control provided by switches and knobs. It just needs to work a little better to become accepted by the commercial marketplace. Progress in speech recognition will likely come from the areas of artificial intelligence and neural networks as much as through DSP itself. Don't think of this as a technical difficulty; think of it as a technical opportunity.

~ Steven W. Smith, The Scientist and Engineer's Guide to Digital Signal Processing

The theory [using RNNs in speech recognition] goes back to a lot of IBM work way before the 1980s. I could see the potential in the very late 1980s and early 1990s of neural nets and speech recognition. Recurrent neural nets had the wonderful advantage that they feed back on themselves. So much of how you say the next bit depends on how you said anything else. For example, I can tell you're a male speaker just from the sound of your voice. Hidden Markov models have this really weird assumption in them that all of that history didn't matter. The next sample could come from a male speaker or a female speaker, they lost all that consistency. It was the very first time that continuous recognition was done. We used some DSP chip we had lying around.

~ Tony Robinson

from Deep Learning with Dr Tony Robinson (2017)

 {kind: quote}


Memory and Forgetting...

Survey of Speech Analysis Toolkits


Free / Open Source / Offline

CMUSphinx https://cmusphinx.github.io/wiki/

OpenEars https://www.politepix.com/openears/

(mobile offline speech recognition, build using CMUSphinx)

Kaldi http://kaldi-asr.org/doc/

Julius https://julius.osdn.jp/en_index.php


Eesen https://github.com/srvk/eesen

End-to-End ASR using deep neural networks (RNNs) built upon Tensorflow.


Wolfram Deep Speech 2, trained RNN model using Wolfram Language



Tensorflow implementation of Deep Speech by Mozilla, with Python and Javascript bindings





VoxForge, free/open acoustic models and transcriptions



HTK, Hidden Markov Model Toolkit (command line tools)

Is technically proprietary, license owned by Microsoft, but source available and open for modification.



RASR, Aachen University Speech Recognition System

C++ libraries, source provided



CSLU Toolkit, Windows only, no longer actively supported



UIUC (University of Illinois Urbana-Champaigne) Statistical Speech Technology Github Repository




CNTK, Microsoft Cognitive Toolkit, (Python, Windows/Linux)

General purpose statistical graph-based toolkit.


Tensorflow, (Python, C, Java, Javascript) - Mac/Windows/Linux (no GPU acceleration for Mac). General purpose data-flow graph computation framework.


GMTK, Graphical Models Toolkit. Mac/Linux only. Another general purpose graphical model computing framework. Developed at the University of Washington and supported by MARCO, DARPA and the NSF.


Tensorflow/Wavenet Speech-to-Text Implementation, Python, github repository


LibriSpeech, free/open ASR corpus



CSTR VCTK Corpus, English multispeaker corpus for CSTR voice cloning toolkit.








Python, Raspberry Pi front-end & toolkit for CMUSphinx


" AIY" DIY / Embedded Speech Recognition Kit

Raspberry Pi, uses Google Voice Assistant




Free / Open Source / Cloud-based


CloudASR, built on top of Kaldi



Voice Command / Voice Coding


Nuance, Dragon

Leading proprietary speech recognition & command system






Proprietary / Cloud-based


Speech to Text


DialogFlow https://dialogflow.com/

Vocapia https://www.vocapia.com/speech-to-text-api.html

Google Speech-to-Text https://cloud.google.com/speech-to-text/

IBM Watson API https://www.ibm.com/watson/services/natural-language-understanding/

Microsoft Cognitive Services https://azure.microsoft.com/en-us/services/cognitive-services/speech/





Voice Biometrics Group http://www.voicebiogroup.com/are-you-a-developer.html



  {function: survey}

IBM Shoebox (1961)

A single realisation of three-dimensional Brownian motion for times 0 ≤ t ≤ 2. Brownian motion has the Markov property of "memorylessness", as the displacement of the particle does not depend on its past displacements.

 {function: caption}

The Voder (1939)

Vocoder-based synthetic speech instrument developed by Homey Dudley at Bell Labs

Automatic Writing, Robert Ashley 1979

(from Wikipedia)

"Automatic Writing" is a piece that took five years to complete and was released by Lovely Music Ltd. in 1979. Ashley used his own involuntary speech that results from his mild form of Tourette's Syndrome as one of the voices in the music. This was obviously considered a very different way of composing and producing music. Ashley stated that he wondered since Tourette's Syndrome had to do with "sound-making and because the manifestation of the syndrome seemed so much like a primitive form of composing whether the syndrome was connected in some way to his obvious tendencies as a composer".[25]

Ashley was intrigued by his involuntary speech, and the idea of composing music that was unconscious. Seeing that the speech that resulted from having Tourette's could not be controlled, it was a different aspect from producing music that is deliberate and conscious, and music that is performed is considered "doubly deliberate" according to Ashley.[25] Although there seemed to be a connection between the involuntary speech, and music, the connection was different due to it being unconscious versus conscious.

Ashley's first attempts at recording his involuntary speech were not successful, because he found that he ended up performing the speech instead of it being natural and unconscious. "The performances were largely imitations of involuntary speech with only a few moments here and there of loss control".[25] However, he was later able to set up a recording studio at Mills College one summer when the campus was mostly deserted, and record 48 minutes of involuntary speech. This was the first of four "characters" that Ashley had envisioned of telling a story in what he viewed as an opera. The other three characters were a French voice translation of the speech, Moog synthesizer articulations, and background organ harmonies. "The piece was Ashley's first extended attempt to find a new form of musical storytelling using the English language. It was opera in the Robert Ashley way".[1]


some afterward listening...


Harpy Speech Recognition System (1976)

Result of the DARPA-funded 1971 "Speech Understanding Research" programme, IBM, Carnegie Mellon University & Stanford Research.


Survey of Speech-to-Text & Voice Analysis Software




Praat, tool for phonetic research




 SPEAR, component sinusoidal analysis & synthesis tool



Sonic Visualiser, extensible audio analysis software



Wavesurfer, sound file analysis & transcription software, seems not to be actively supported



Speech Filing System (Windows only) https://www.phon.ucl.ac.uk/resource/sfs/


T32, speech analysis tool superceeding cspeech. Windows only and appears to not be actively maintained.



 FreeSR, no longer supported



Browser-Based Voice Tools


SpeechTexter (dictation tool, chrome) https://www.speechtexter.com/

SpeechNotes (dictation tool, chrome) https://speechnotes.co/ 

Voice Notebook (voice command tool, chrome) https://voicenotebook.com/


Descript, (Browser/Windows) - dictation-based transcription tool and word processor. https://www.descript.com/



Voice Biometrics, Forensics & Medical Analysis


Easy Voice Biometrics

prohibitively expensive




Proprietary software packages for voice recognition, forensics and analysis.



...looks prohibitively expensive...


Voiceprint Software, Estill Voice International

Clinical software for voice training, recovery and analysis.




Proprietary / Mobile APIs


Cortana (Microsoft)

Siri (Apple)

Alexa (Amazon)


{function: survey}

NLP / Text Parsing

Interesting Text Processing Tools and Libraries


English Language Hate Speech Analysis


Gartner's Hype Cycle graph for 2009, showing speech recognition as approaching the "plateau of productivity"

meta: true
author: JR
kind: reference

function: survey
origin: RC

keywords: [speech to text, voice analysis, speech, speech analysis, phoneme]


Other Interesting Bits and Bobs


Real-time analysis, in SuperCollider, of spectral features of electroglottographic signals. DENNIS JOHANSSON (2016)

Automatic Recognition of Lyrics in Singing, Annamaria MesarosEmail author and Tuomas Virtanen EURASIP Journal on Audio, Speech, and Music Processing2010



Timbral Hauntings Michael Musick, Tae Hong Park 40th International Computer Music Conference (ICMC) and 11th Sound and Music Computing (SMC) conference Publication year: 2014


A Comparison of Sequence-to-Sequence Models for Speech Recognition, Rohit Prabhavalka et al, Google & NVIDIA, Interspeech Conference. 2017


Exploring Neural Transducers for End-to-End Speech Recognition Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, David Seetapun, Anuroop Sriram, Zhenyao Zhu. 2017


Speech Recognition with Deep Recurrent Neural Networks Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton. 2013


Assessing Chronic Stress, Coping Skills, and Mood Disorders through Speech Analysis: A Self-Assessment ‘Voice App’ for Laptops, Tablets, and Smartphones (2016)

OPTIMI, University of Zurich


Vocal Features of Song and Speech: Insights from Schoenberg's Pierrot Lunaire Julia Merrill and Pauline Larrouy-Maestri, frontiers in Psychology, 2017


Can we steal your vocal identity from the internet?: Initial investigation of cloning Obama's voice using GAN, WaveNet and low-quality found data. (2018)


Connectionist Temporal Classification (CTC) decoding algorithms (Python & OpenCL) github repository


Cochlea, (Python) inner-ear biophysical model and simulation of auditory nerve activity. github repository


Learning Resources


Speech Tools Course (2009) University of Illinois


Lecture Notes in Speech Production, Speech Coding, and Speech Recognition. (2000) Mark Hasegawa-Johnson, University of Illinois.


Feature Extraction Techniques for Speech Recognition: A Review (2015). Kishori R. Ghule, R.R. Deshmukh


{function: survey}

Different stages, Cepstrum analysis of a pulse train