Evaluating Speech-to-Text Systems

How to Evaluate the Performance of a Speech-to-Text system

3 min readMar 30, 2021

Neon light speech bubble with the word ‘hello’ inside — Photo by Adam Solomon on Unsplash

Speech-to-Text (STT) is the task of converting audio to words. An automatic STT system will often return not just its best guess as to the words spoken, but additional information like timestamps and confidence measures. To judge how well an STT system performs means measuring the mistakes it makes. STT systems make three kinds of mistakes when comparing the output of automatic transcription to a reference transcription :-

They mistake (or substitute) one word for another
They insert words into their transcript that weren’t spoken in the original audio
They miss (or delete) words that were actually spoken in the audio

The most commonly used measure of STT performance is Word Error Rate (WER). It’s calculated by adding up the total number of words inserted (I), deleted (D) and substituted (S), and dividing by the number of words in the reference transcription (N):

WER is usually computed using the Levenshtein distance to align the reference transcription with the automatic one. There’s no need to use any of the additional information that an STT system returns, like the timestamps or confidence measures.

Suppose someone says “It’s nice and sunny at the sandy beach today” and the STT system recognises this audio as “It’s bright and sunny at sandy beach hut today yes”. The alignment of these two word sequences looks like:

Reference: It's nice   and sunny at the sandy beach     today
STT:       It's bright and sunny at     sandy beach hut today yes
           C    S      C   C     C  D   C     C     I   C     I

The bottom row shows a label for words that are correct (C), inserted (I), substituted (S) and deleted (D). In this example, there is 1 substitution, 1 deletion, 2 insertions, and 9 words in the original transcription. Therefore, the WER of this example utterance is 44%.

Often, text is not written exactly as it’s spoken. Suppose someone says “It’s July 1st 2020” but the automatic STT returns “It’s July first twenty twenty”, then the two are clearly using different written forms of the same spoken words. Aligning these two gives:

Reference: It's July 1st   2020
STT:       It's July first twenty twenty
           C    C    S     S      I

A naive application of WER scores this as a whopping 75% WER, although there’s really nothing wrong with the STT transcription! The issue is that the reference is in written form, and the STT output is in spoken form, and so the naive calculation of WER is artificially inflated. Usually, handcrafted text transformation rules convert between written and spoken form.

To estimate the overall WER of an STT system means looking at performance over a large corpus of audio rather than at individual utterances. Typically, WER is averaged over a corpus of data containing many minutes or hours of audio data, to give a reliable measure of performance.

I work as a consultant in voice, language & AI technology, and you can hire me! If your organisation uses voice technology, AI & machine learning, please get in touch.

Evaluating Speech-to-Text Systems

How to Evaluate the Performance of a Speech-to-Text system

Written by Catherine Breslin