Word error rate (WER): how transcription accuracy is measured
Understand word error rate (WER), the standard metric for measuring transcription accuracy. Learn how WER is calculated and what makes a good score.
Word error rate (WER) is the standard metric for measuring the accuracy of automatic speech recognition (ASR) systems -- it calculates the percentage of words in a transcript that differ from a verified reference transcript through substitutions, deletions, and insertions.
Whether you are evaluating transcription services, benchmarking ASR models, or trying to understand what "95% accuracy" actually means in practice, WER is the number that matters. This guide explains how WER works, what constitutes a good score, and why the metric has both strengths and important limitations.
What is word error rate?
Word error rate measures how many words a transcription system got wrong compared to a ground-truth reference transcript. It is expressed as a percentage, where lower values indicate better accuracy: a WER of 5% means the system made errors on 5 out of every 100 words.
The formula for WER is:
WER = (S + D + I) / N x 100%
Where:
- S (Substitutions): Words that were replaced with a different word. The reference says "cat" but the transcript says "cap."
- D (Deletions): Words present in the reference that are missing from the transcript. A word was spoken but not transcribed.
- I (Insertions): Words in the transcript that do not appear in the reference. The system added a word that was never spoken.
- N: The total number of words in the reference transcript.
A WER of 0% means the transcript perfectly matches the reference. A WER of 100% means the number of errors equals the total number of reference words. WER can actually exceed 100% if the system inserts more words than the reference contains, though this is uncommon with modern systems.
Why these three error types matter
Each error type reflects a different failure mode in speech recognition:
- Substitutions are the most common error type. They happen when the acoustic model confuses similar-sounding words ("their" vs. "there"), when the language model picks a statistically likely but incorrect word, or when accents and dialects cause misrecognition.
- Deletions occur when the system misses words entirely. This is common with filler words ("um," "uh"), rapid speech, overlapping speakers, or low-volume passages.
- Insertions happen when the system hallucinates words that were not spoken. Background noise, echo, or music can trigger false word detections.
Understanding the breakdown of S, D, and I errors is often more useful than the aggregate WER number alone, because it reveals where the system is failing and what might be done to improve results.
How WER is calculated
WER calculation relies on dynamic programming to find the minimum edit distance between the reference transcript and the hypothesis (system output). This is the same algorithm used for string edit distance (Levenshtein distance), applied at the word level.
Here is a step-by-step example.
Reference transcript (what was actually said):
The quick brown fox jumps over the lazy dog
Hypothesis transcript (what the system produced):
The quik brown fox jump over a lazy dock
Step 1: Align the transcripts word by word.
| Reference | The | quick | brown | fox | jumps | over | the | lazy | dog |
|---|---|---|---|---|---|---|---|---|---|
| Hypothesis | The | quik | brown | fox | jump | over | a | lazy | dock |
| Error type | -- | S | -- | -- | S | -- | S | -- | S |
Step 2: Count each error type.
- Substitutions (S): 4 ("quick" -> "quik", "jumps" -> "jump", "the" -> "a", "dog" -> "dock")
- Deletions (D): 0 (no words were omitted)
- Insertions (I): 0 (no extra words were added)
Step 3: Apply the formula.
WER = (4 + 0 + 0) / 9 x 100% = 44.4%
The total number of words in the reference (N) is 9. With 4 substitution errors, the WER is 44.4%.
A more complex example
Consider a case with all three error types.
Reference: "She sells sea shells by the seashore"
Hypothesis: "She sell sea shells on seashore today"
Alignment:
| Reference | She | sells | sea | shells | by | the | seashore | -- |
|---|---|---|---|---|---|---|---|---|
| Hypothesis | She | sell | sea | shells | on | -- | seashore | today |
| Error type | -- | S | -- | -- | S | D | -- | I |
- S = 2 ("sells" -> "sell", "by" -> "on")
- D = 1 ("the" was deleted)
- I = 1 ("today" was inserted)
- N = 7
WER = (2 + 1 + 1) / 7 x 100% = 57.1%
In practice, the alignment step is computed algorithmically because manually aligning long transcripts with many insertions and deletions is error-prone. Research tools like NIST's sclite and Python's jiwer library automate this process.
What is a good WER?
WER benchmarks vary significantly depending on audio quality, domain, number of speakers, and language. Here is a general guide for English transcription.
| WER range | Quality level | Typical scenario |
|---|---|---|
| Below 5% | Excellent | Studio-quality audio, single speaker, clear speech, common vocabulary |
| 5% -- 10% | Good | Professional recordings, meetings in quiet rooms, interviews with good microphones |
| 10% -- 15% | Acceptable | Conference calls, webinars, moderate background noise |
| 15% -- 20% | Fair | Noisy environments, accented speech, multiple overlapping speakers |
| Above 20% | Poor | Very noisy audio, heavy accents, poor microphone quality, distant speech |
For reference, professional human transcriptionists typically achieve a WER of 4% -- 6% under favorable conditions. The gap between human and machine performance has narrowed dramatically in recent years, with the best AI systems now matching or approaching human-level accuracy on clean audio.
The quality level you need depends on your use case. A 10% WER might be perfectly acceptable for meeting notes where participants can fill in context, but it would be insufficient for legal depositions or medical transcripts where every word matters.
WER benchmarks for modern AI
Modern automatic speech recognition systems have improved substantially since 2020. Here are approximate WER figures for well-known ASR systems on standard English benchmarks.
| System | Approximate WER (clean speech) | Notes |
|---|---|---|
| OpenAI Whisper (large-v3) | 3% -- 5% | Open-source, multilingual, strong on diverse accents |
| Google Cloud Speech-to-Text (v2) | 4% -- 6% | Cloud API, supports real-time and batch transcription |
| AWS Amazon Transcribe | 5% -- 8% | Cloud API, includes speaker diarization |
| Microsoft Azure Speech | 4% -- 7% | Cloud API, customizable language models |
| Deepgram Nova-2 | 3% -- 5% | Optimized for speed and accuracy |
| Meta MMS | 5% -- 10% | Open-source, covers 1,100+ languages |
These numbers are approximate and come from published benchmarks, research papers, and independent evaluations. Actual performance varies significantly based on audio conditions, domain vocabulary, accent, and language. A system that achieves 4% WER on a clean TED talk may produce 15%+ WER on a noisy phone call.
It is also worth noting that vendors often report WER on carefully selected benchmarks. Real-world performance -- with background noise, cross-talk, domain-specific jargon, and varied recording equipment -- is typically higher than published figures. When evaluating a transcription service, test it on your own audio rather than relying solely on benchmark claims.
Limitations of WER
WER is a useful but imperfect metric. Understanding its limitations helps you interpret accuracy claims more critically.
WER ignores semantic correctness
WER treats all word errors equally. Transcribing "I need to book a flight" as "I need to cook a flight" counts as one substitution error, the same as transcribing it as "I need to book a fright." But the first error is more harmful to meaning than the second. WER has no concept of how much an error damages comprehension.
Punctuation and capitalization are excluded
Standard WER evaluation strips punctuation and normalizes case before comparison. This means a transcript with perfect words but missing periods, commas, and question marks would score 0% WER despite being difficult to read. Conversely, a transcript with correct punctuation but word errors is penalized fully.
Formatting and structure are invisible
WER does not account for paragraph breaks, speaker labels, timestamps, or any structural formatting. Two transcripts with identical text but vastly different readability (one is a wall of text, the other is properly segmented by speaker) would receive the same WER score. For use cases like meeting transcripts where structure matters, WER alone is insufficient.
Short utterances inflate WER
WER is a ratio, so short phrases produce volatile scores. If the reference is "Yes, absolutely" (2 words) and the system outputs "Yes, definitely," that single substitution produces a 50% WER. The same type of error in a 200-word passage would contribute only 0.5% to WER. This makes WER less meaningful for evaluating short-form transcription tasks.
Normalization differences cause inconsistency
How you normalize text before computing WER affects the result. Should "Dr." and "Doctor" be treated as a match? What about "100" vs. "one hundred"? Different evaluation pipelines make different normalization choices, which is why WER numbers from different sources are not always directly comparable.
Other accuracy metrics
Researchers and practitioners have developed several alternative and complementary metrics to address WER's limitations.
Character error rate (CER)
CER applies the same substitution/deletion/insertion formula at the character level rather than the word level. CER is particularly useful for languages without clear word boundaries (such as Chinese, Japanese, and Thai) and for evaluating the severity of errors. A substitution of "cat" for "car" is 1 error in WER but only 1 character error in CER, while "cat" for "elephant" is still 1 WER error but many character errors.
CER = (Sc + Dc + Ic) / Nc x 100%
Where Sc, Dc, Ic are character-level substitutions, deletions, and insertions, and Nc is the total number of characters in the reference.
Match error rate (MER)
MER adjusts the WER formula to account for the total number of matches rather than just the reference length. It provides a more balanced view of accuracy when the hypothesis and reference differ significantly in length.
MER = (S + D + I) / (S + D + C) x 100%
Where C is the number of correct (matching) words.
Word information lost (WIL)
WIL measures how much information is lost in the transcription process. Unlike WER, which focuses on errors, WIL considers both precision (how much of the hypothesis is correct) and recall (how much of the reference was captured). WIL ranges from 0 (perfect) to 1 (complete information loss).
Semantic distance metrics
Newer evaluation approaches use language models to measure the semantic similarity between reference and hypothesis transcripts rather than exact word matching. These metrics better capture whether the meaning was preserved, even if the exact words differ. Research in this area is active but these metrics are not yet standardized.
How to improve your transcription WER
Whether you are using AI transcription or human transcription, audio quality is the single biggest factor affecting accuracy. Here are practical steps to improve your WER.
Record with a good microphone
Use a dedicated microphone rather than a laptop's built-in mic. For solo recordings, a USB condenser microphone positioned 6-12 inches from the speaker produces dramatically better results than a webcam mic across the room. For meetings, a conference speakerphone with beamforming microphones improves recognition accuracy for all participants.
Minimize background noise
Record in a quiet environment whenever possible. Close windows, turn off fans and air conditioners, and avoid locations with ambient music or conversation. Even modern noise-robust ASR models perform measurably better with clean audio. For tips on handling unavoidable noise, see our guide on transcribing noisy audio.
Speak clearly and at a moderate pace
Rapid speech, mumbling, and trailing off at the end of sentences all increase WER. When recording content that will be transcribed, maintain a consistent speaking pace and enunciate clearly. This does not mean speaking unnaturally slowly -- just avoid rushing through important points.
Use a higher audio bitrate
Compress audio at 128 kbps or higher for speech. Heavily compressed audio (64 kbps or below) discards acoustic detail that ASR systems rely on for accurate recognition. If you are recording specifically for transcription, 256 kbps or lossless formats preserve the most useful signal.
Avoid overlapping speech
When multiple people speak simultaneously, even the best diarization systems struggle to separate and transcribe both speakers accurately. In meetings and interviews, establish turn-taking norms. If overlap is unavoidable, using individual microphones for each speaker significantly improves results.
Choose the right transcription tool
Different ASR systems have different strengths. Some handle accented speech better, others excel at domain-specific vocabulary, and some are optimized for noisy conditions. Vocova supports over 100 languages with automatic language detection and speaker diarization, which helps maintain accuracy across diverse recording conditions. Testing your specific audio type with a service before committing to a workflow is always worthwhile.
Post-process with context
After transcription, review the output with the original audio. Domain-specific terms, proper nouns, and acronyms are the most common error categories. Many transcription tools allow you to edit the transcript directly, and some support custom vocabulary lists that reduce errors on known terminology.
Frequently asked questions
What is considered a good word error rate?
A WER below 5% is considered excellent and is comparable to professional human transcription quality. For most business applications -- meeting notes, interview transcripts, content creation -- a WER between 5% and 10% is considered good and produces usable transcripts with minimal editing required.
Can WER be greater than 100%?
Yes. Because insertions add to the error count but not to the reference word count (N), a system that produces many extra words can exceed 100% WER. For example, if the reference is 10 words and the system outputs 25 words with numerous errors, the (S + D + I) / N calculation can produce a value above 1.0. This is rare with modern systems but mathematically possible.
How is WER different from accuracy?
Accuracy is sometimes reported as (1 - WER). A WER of 8% corresponds to 92% accuracy. However, "accuracy" is used loosely in marketing and can refer to different evaluation methodologies. Always ask what metric is being used and how the evaluation was conducted when you see accuracy claims from transcription providers.
Why do different ASR systems report different WER for the same audio?
WER depends on the evaluation dataset, text normalization pipeline, and scoring methodology. One vendor might normalize "Dr. Smith" to "doctor smith" before scoring while another leaves it as-is. One might evaluate on clean read speech while another uses conversational audio. These methodological differences make direct comparisons unreliable unless the same evaluation protocol is used.
Does WER account for punctuation errors?
No. Standard WER evaluation removes all punctuation before alignment and scoring. A transcript with perfect words but no punctuation at all would achieve 0% WER. Punctuation accuracy requires separate evaluation metrics, which are less standardized than WER.
How do I calculate WER for my own transcripts?
The most accessible tool is the Python jiwer library. Install it with pip install jiwer, then compute WER with a few lines of code:
from jiwer import wer
reference = "the quick brown fox jumps over the lazy dog"
hypothesis = "the quik brown fox jump over a lazy dock"
error_rate = wer(reference, hypothesis)
print(f"WER: {error_rate:.1%}") # Output: WER: 44.4%
For longer transcripts, you will need a verified reference transcript to compare against. This typically means having a human transcriptionist produce a ground-truth version of the audio.