What is word error rate (WER)? The metric that measures transcription accuracy

Word error rate (WER) is the standard metric for measuring the accuracy of automatic speech recognition (ASR) systems -- it calculates the percentage of words in a transcript that differ from a verified reference transcript through substitutions, deletions, and insertions.

Whether you are evaluating transcription services, benchmarking ASR models, or trying to understand what "95% accuracy" actually means in practice, WER is the number that matters. This guide explains how WER works, what constitutes a good score, and why the metric has both strengths and important limitations.

What is word error rate?

Word error rate measures how many words a transcription system got wrong compared to a ground-truth reference transcript. It is expressed as a percentage, where lower values indicate better accuracy: a WER of 5% means the system made errors on 5 out of every 100 words.

The formula for WER is:

WER = (S + D + I) / N x 100%

Where:

S (Substitutions): Words that were replaced with a different word. The reference says "cat" but the transcript says "cap."
D (Deletions): Words present in the reference that are missing from the transcript. A word was spoken but not transcribed.
I (Insertions): Words in the transcript that do not appear in the reference. The system added a word that was never spoken.
N: The total number of words in the reference transcript.

A WER of 0% means the transcript perfectly matches the reference. A WER of 100% means the number of errors equals the total number of reference words. WER can actually exceed 100% if the system inserts more words than the reference contains, though this is uncommon with modern systems.

Why these three error types matter

Each error type reflects a different failure mode in speech recognition:

Substitutions are the most common error type. They happen when the acoustic model confuses similar-sounding words ("their" vs. "there"), when the language model picks a statistically likely but incorrect word, or when accents and dialects cause misrecognition.
Deletions occur when the system misses words entirely. This is common with filler words ("um," "uh"), rapid speech, overlapping speakers, or low-volume passages.
Insertions happen when the system hallucinates words that were not spoken. Background noise, echo, or music can trigger false word detections.

Understanding the breakdown of S, D, and I errors is often more useful than the aggregate WER number alone, because it reveals where the system is failing and what might be done to improve results.

How WER is calculated

WER calculation relies on dynamic programming to find the minimum edit distance between the reference transcript and the hypothesis (system output). This is the same algorithm used for string edit distance (Levenshtein distance), applied at the word level.

Here is a step-by-step example.

Reference transcript (what was actually said):

The quick brown fox jumps over the lazy dog

Hypothesis transcript (what the system produced):

The quik brown fox jump over a lazy dock

Step 1: Align the transcripts word by word.

Reference	The	quick	brown	fox	jumps	over	the	lazy	dog
Hypothesis	The	quik	brown	fox	jump	over	a	lazy	dock
Error type	--	S	--	--	S	--	S	--	S

Step 2: Count each error type.

Substitutions (S): 4 ("quick" -> "quik", "jumps" -> "jump", "the" -> "a", "dog" -> "dock")
Deletions (D): 0 (no words were omitted)
Insertions (I): 0 (no extra words were added)

Step 3: Apply the formula.

WER = (4 + 0 + 0) / 9 x 100% = 44.4%

The total number of words in the reference (N) is 9. With 4 substitution errors, the WER is 44.4%.

A more complex example

Consider a case with all three error types.

Reference: "She sells sea shells by the seashore"

Hypothesis: "She sell sea shells on seashore today"

Alignment:

Reference	She	sells	sea	shells	by	the	seashore	--
Hypothesis	She	sell	sea	shells	on	--	seashore	today
Error type	--	S	--	--	S	D	--	I

S = 2 ("sells" -> "sell", "by" -> "on")
D = 1 ("the" was deleted)
I = 1 ("today" was inserted)
N = 7

WER = (2 + 1 + 1) / 7 x 100% = 57.1%

In practice, the alignment step is computed algorithmically because manually aligning long transcripts with many insertions and deletions is error-prone. Research tools like NIST's sclite and Python's jiwer library automate this process.

What is a good WER?

WER benchmarks vary significantly depending on audio quality, domain, number of speakers, and language. Here is a general guide for English transcription.

WER range	Quality level	Typical scenario
Below 5%	Excellent	Studio-quality audio, single speaker, clear speech, common vocabulary
5% -- 10%	Good	Professional recordings, meetings in quiet rooms, interviews with good microphones
10% -- 15%	Acceptable	Conference calls, webinars, moderate background noise
15% -- 20%	Fair	Noisy environments, accented speech, multiple overlapping speakers
Above 20%	Poor	Very noisy audio, heavy accents, poor microphone quality, distant speech

For reference, studies commonly cite human transcription error rates in the low single digits (often around 4-5%) on clean audio. The gap between human and machine performance has narrowed dramatically in recent years, with the best AI systems now matching or approaching human-level accuracy on clean, read-speech benchmarks.

The quality level you need depends on your use case. A 10% WER might be perfectly acceptable for meeting notes where participants can fill in context, but it would be insufficient for legal depositions or medical transcripts where every word matters.

WER benchmarks for modern AI

Be careful comparing WER across vendors: providers report on different benchmarks and audio conditions, so the numbers are rarely apples-to-apples. A few anchors from sources that disclose their methodology:

On the clean LibriSpeech test-clean benchmark, leading ASR systems report around 2-3% WER; OpenAI's Whisper large-v3 reaches roughly 2.7% (OpenAI Whisper large-v3 model card; see also Radford et al., 2022). The same model rises to about 5% on the noisier test-other split, and higher on real-world audio.
The Hugging Face Open ASR Leaderboard ranks models on a common English suite; the current leader (NVIDIA Canary) sits near 5.6% average WER.
On real-world mixed audio, vendor-reported numbers are higher: Deepgram reports its Nova-2 model at about 8.4% median WER across podcast, meeting, and phone domains, and independent indexes such as Artificial Analysis show real-world WER well above clean-benchmark figures. The big cloud vendors (Google, AWS, Azure) do not publish a single standardized clean-speech WER, so treat any precise per-vendor figure with caution.

The takeaway: a "3-5%" headline almost always refers to a clean read-speech benchmark, not the noisy, accented, multi-speaker audio most people actually transcribe. Expect real-world WER several points higher than any clean-benchmark number -- see transcription accuracy by language.

It is also worth noting that vendors often report WER on carefully selected benchmarks. Real-world performance -- with background noise, cross-talk, domain-specific jargon, and varied recording equipment -- is typically higher than published figures. When evaluating a transcription service, test it on your own audio rather than relying solely on benchmark claims.

Limitations of WER

WER is a useful but imperfect metric. Understanding its limitations helps you interpret accuracy claims more critically.

WER ignores semantic correctness

WER treats all word errors equally. Transcribing "I need to book a flight" as "I need to cook a flight" counts as one substitution error, the same as transcribing it as "I need to book a fright." But the first error is more harmful to meaning than the second. WER has no concept of how much an error damages comprehension.

Punctuation and capitalization are excluded

Standard WER evaluation strips punctuation and normalizes case before comparison. This means a transcript with perfect words but missing periods, commas, and question marks would score 0% WER despite being difficult to read. Conversely, a transcript with correct punctuation but word errors is penalized fully.

Formatting and structure are invisible

WER does not account for paragraph breaks, speaker labels, timestamps, or any structural formatting. Two transcripts with identical text but vastly different readability (one is a wall of text, the other is properly segmented by speaker) would receive the same WER score. For use cases like meeting transcripts where structure matters, WER alone is insufficient.

Short utterances inflate WER

WER is a ratio, so short phrases produce volatile scores. If the reference is "Yes, absolutely" (2 words) and the system outputs "Yes, definitely," that single substitution produces a 50% WER. The same type of error in a 200-word passage would contribute only 0.5% to WER. This makes WER less meaningful for evaluating short-form transcription tasks.

Normalization differences cause inconsistency

How you normalize text before computing WER affects the result. Should "Dr." and "Doctor" be treated as a match? What about "100" vs. "one hundred"? Different evaluation pipelines make different normalization choices, which is why WER numbers from different sources are not always directly comparable.

Other accuracy metrics

Researchers and practitioners have developed several alternative and complementary metrics to address WER's limitations.

Character error rate (CER)

CER applies the same substitution/deletion/insertion formula at the character level rather than the word level. CER is particularly useful for languages without clear word boundaries (such as Chinese, Japanese, and Thai) and for evaluating the severity of errors. A substitution of "cat" for "car" is 1 error in WER but only 1 character error in CER, while "cat" for "elephant" is still 1 WER error but many character errors.

CER = (Sc + Dc + Ic) / Nc x 100%

Where Sc, Dc, Ic are character-level substitutions, deletions, and insertions, and Nc is the total number of characters in the reference.

Match error rate (MER)

MER adjusts the WER formula to account for the total number of matches rather than just the reference length. It provides a more balanced view of accuracy when the hypothesis and reference differ significantly in length.

MER = (S + D + I) / (S + D + C) x 100%

Where C is the number of correct (matching) words.

Word information lost (WIL)

WIL measures how much information is lost in the transcription process. Unlike WER, which focuses on errors, WIL considers both precision (how much of the hypothesis is correct) and recall (how much of the reference was captured). WIL ranges from 0 (perfect) to 1 (complete information loss).

Semantic distance metrics

Newer evaluation approaches use language models to measure the semantic similarity between reference and hypothesis transcripts rather than exact word matching. These metrics better capture whether the meaning was preserved, even if the exact words differ. Research in this area is active but these metrics are not yet standardized.

How to improve your transcription WER

Whether you are using AI transcription or human transcription, audio quality is the single biggest factor affecting accuracy. Here are practical steps to improve your WER.

Record with a good microphone

Use a dedicated microphone rather than a laptop's built-in mic. For solo recordings, a USB condenser microphone positioned 6-12 inches from the speaker produces dramatically better results than a webcam mic across the room. For meetings, a conference speakerphone with beamforming microphones improves recognition accuracy for all participants.

Minimize background noise

Record in a quiet environment whenever possible. Close windows, turn off fans and air conditioners, and avoid locations with ambient music or conversation. Even modern noise-robust ASR models perform measurably better with clean audio. For tips on handling unavoidable noise, see our guide on transcribing noisy audio.

Speak clearly and at a moderate pace

Rapid speech, mumbling, and trailing off at the end of sentences all increase WER. When recording content that will be transcribed, maintain a consistent speaking pace and enunciate clearly. This does not mean speaking unnaturally slowly -- just avoid rushing through important points.

Use a higher audio bitrate

Compress audio at 128 kbps or higher for speech. Heavily compressed audio (64 kbps or below) discards acoustic detail that ASR systems rely on for accurate recognition. If you are recording specifically for transcription, 256 kbps or lossless formats preserve the most useful signal.

Avoid overlapping speech

When multiple people speak simultaneously, even the best diarization systems struggle to separate and transcribe both speakers accurately. In meetings and interviews, establish turn-taking norms. If overlap is unavoidable, using individual microphones for each speaker significantly improves results.

Choose the right transcription tool

Different ASR systems have different strengths. Some handle accented speech better, others excel at domain-specific vocabulary, and some are optimized for noisy conditions. Vocova supports over 100 languages with automatic language detection and speaker diarization, which helps maintain accuracy across diverse recording conditions. Testing your specific audio type with a service before committing to a workflow is always worthwhile.

Post-process with context

After transcription, review the output with the original audio. Domain-specific terms, proper nouns, and acronyms are the most common error categories. Many transcription tools allow you to edit the transcript directly, and some support custom vocabulary lists that reduce errors on known terminology.

Frequently asked questions

What is considered a good word error rate?

A WER below 5% on clean audio is considered excellent and is comparable to professional human transcription quality. For most business applications -- meeting notes, interview transcripts, content creation -- a WER between 5% and 10% is considered good and produces usable transcripts with minimal editing required.

Can WER be greater than 100%?

Yes. Because insertions add to the error count but not to the reference word count (N), a system that produces many extra words can exceed 100% WER. For example, if the reference is 10 words and the system outputs 25 words with numerous errors, the (S + D + I) / N calculation can produce a value above 1.0. This is rare with modern systems but mathematically possible.

How is WER different from accuracy?

Accuracy is sometimes reported as (1 - WER). A WER of 8% corresponds to 92% accuracy. However, "accuracy" is used loosely in marketing and can refer to different evaluation methodologies. Always ask what metric is being used and how the evaluation was conducted when you see accuracy claims from transcription providers.

Why do different ASR systems report different WER for the same audio?

WER depends on the evaluation dataset, text normalization pipeline, and scoring methodology. One vendor might normalize "Dr. Smith" to "doctor smith" before scoring while another leaves it as-is. One might evaluate on clean read speech while another uses conversational audio. These methodological differences make direct comparisons unreliable unless the same evaluation protocol is used.

Does WER account for punctuation errors?

No. Standard WER evaluation removes all punctuation before alignment and scoring. A transcript with perfect words but no punctuation at all would achieve 0% WER. Punctuation accuracy requires separate evaluation metrics, which are less standardized than WER.

How do I calculate WER for my own transcripts?

The most accessible tool is the Python jiwer library. Install it with pip install jiwer, then compute WER with a few lines of code:

from jiwer import wer

reference = "the quick brown fox jumps over the lazy dog"
hypothesis = "the quik brown fox jump over a lazy dock"

error_rate = wer(reference, hypothesis)
print(f"WER: {error_rate:.1%}")  # Output: WER: 44.4%

For longer transcripts, you will need a verified reference transcript to compare against. This typically means having a human transcriptionist produce a ground-truth version of the audio.

Sources and further reading

NIST SCTK (sclite) documentation -- the standard WER scoring toolkit and the (S+D+I)/N definition
Word error rate (Wikipedia) -- the formula and why WER can exceed 100%
Morris, Maier & Green, "From WER and RIL to MER and WIL" (Interspeech 2004) -- Match Error Rate and Word Information Lost
jiwer documentation -- computing WER/MER/WIL/CER in Python
Radford et al., "Whisper paper" (2022) | Open ASR Leaderboard | Artificial Analysis Speech-to-Text