For anyone working with speech recognition technology, understanding Word Error Rate (WER) is crucial. WER is a common metric used to evaluate the performance of speech recognition systems, including AI transcription services.
It measures the percentage of words that are misrecognized by comparing the AI-generated transcript to a human-made transcript, also known as the “ground truth”. The errors considered in WER calculation include substitutions, insertions, and deletions.
WER calculates the discrepancy between the original sequence of words (reference) and the system’s recognized output. In simple terms, WER tells you how many errors a system makes when converting spoken words to text. It considers three types of errors:
- Substitutions: When a word is recognized incorrectly (e.g., “sea” instead of “see”)
- Insertions: When the system adds extra words that weren’t spoken (e.g., “brown fox” becomes “the brown a fox”).
- Deletions: When the system misses words entirely (e.g., “quick brown fox” becomes “quick fox”).
Here’s the formula:
WER = (Substitutions + Insertions + Deletions) / Number of Words Spoken
A lower WER indicates better performance. For instance, a WER of 10% means the system made errors in 10% of the words spoken.
The ideal WER score can vary depending on the specific requirements and the nature of the text. However, here are some general guidelines:
- A WER of 5-10% is considered to be of good quality and is ready to use.
- A WER of 20% is acceptable, but you might want to consider additional training.
- A WER of 30% or more signals poor quality and requires customization and training.
WER serves a valuable purpose:
- Comparing Systems: It enables a fair comparison of the performance between different speech recognition or machine translation engines.
- Tracking Improvements: Developers can monitor WER over time to assess how their system’s accuracy is progressing.
However, WER has limitations. It doesn’t account for:
- Severity of Errors: Not all errors are created equal. A misspelling might be less critical than a complete omission.
- Semantic Accuracy: WER doesn’t ensure the translated text conveys the intended meaning.