How accurate is AI transcription? WER results across 50+ languages (2026)
AI transcription accuracy varies wildly by language. We tested Whisper, NVIDIA Canary, and 5 other models against 50+ languages. See which models are usable for Japanese, Arabic, Vietnamese, and your target language.
Transcription accuracy varies dramatically by language. On clean audio in 2026, the best automatic speech recognition (ASR) systems achieve word error rates below 5% in English, Spanish, and Mandarin, between 7-12% in mid-resource languages like Polish, Korean, and Vietnamese, and 20-40% or worse in many low-resource languages such as Amharic, Yoruba, or Sinhala. The accuracy gap comes down to training data volume, phonetic complexity, and the diversity of dialects each model has seen.
This guide compiles published WER benchmarks from Whisper, NVIDIA Canary, Google USM, and the Hugging Face Open ASR Leaderboard, organized by language tier. If you are evaluating a transcription tool for a specific language, or trying to understand why your German audio transcribes flawlessly but your Thai audio does not, the data below explains the gap.
TL;DR: accuracy tiers at a glance
| Tier | WER range | Languages (representative) | What to expect |
|---|---|---|---|
| Tier 1 | 2-6% WER | English, Mandarin, Spanish, French, German, Japanese, Italian, Portuguese | Near-human accuracy on clean audio |
| Tier 2 | 6-12% WER | Korean, Dutch, Russian, Arabic, Turkish, Polish, Catalan, Swedish | Production-grade, minor edits needed |
| Tier 3 | 12-20% WER | Vietnamese, Hindi, Thai, Greek, Romanian, Ukrainian, Hebrew, Indonesian | Usable, expect meaningful manual cleanup |
| Tier 4 | 20-40% WER | Tamil, Bengali, Swahili, Filipino, Malay, Urdu, Nepali | Rough draft quality, human review required |
| Tier 5 | >40% WER | Amharic, Yoruba, Sinhala, Khmer, Lao, Burmese, Maltese | Experimental, often unusable without heavy post-editing |
Sources: OpenAI Whisper paper (2022), FLEURS benchmark (Google Research, 2022), Hugging Face Open ASR Leaderboard, NVIDIA Canary-1B-v2 (2025).
How WER benchmarks are measured
Every number in this post comes from one of three public benchmark suites. Understanding what each one tests prevents the common mistake of comparing a lab score to real-world performance.
LibriSpeech (English only) uses clean audiobook recordings. It is the easiest benchmark most models run against, so its numbers are the floor of what a model can do under ideal conditions. State-of-the-art English WER on LibriSpeech test-clean is around 1.4-2.7%.
FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) covers 102 languages with ~12 hours of speech per language. It uses the same sentences across languages (translations of Wikipedia content), which makes cross-language comparison meaningful. FLEURS is the most widely cited multilingual benchmark.
Common Voice (Mozilla) contains crowdsourced recordings across 100+ languages. It is noisier than FLEURS because speakers are non-professionals in varied environments, so Common Voice WER is typically 2-5 points higher than FLEURS on the same language.
Real-world audio, with accents, overlapping speakers, background noise, and imperfect recording equipment, adds another 5-15 WER points on top of benchmark numbers. A model reporting 5% WER on FLEURS may deliver 10-15% on a typical Zoom recording.
Tier 1: very high accuracy (2-6% WER)
These languages have the largest training corpora (tens of thousands of hours of labeled audio) and the most attention from model developers. Expect production-ready transcripts on clean audio with minimal editing.
| Language | Whisper large-v3 FLEURS WER | NVIDIA Canary WER (where available) | Notes |
|---|---|---|---|
| English | 4.2% | 6.5% (Canary-1B Common Voice) | Reference language, most benchmarks focus here |
| Spanish | 3.0% | 4.6% | Especially strong on Latin American varieties |
| Mandarin Chinese | 5.7% (CER) | -- | Measured in character error rate, not WER |
| French | 4.7% | 6.0% | European French dominates training data |
| German | 4.5% | 4.8% | Strong on standard German; Swiss/Austrian dialects degrade |
| Italian | 4.0% | 4.2% | Among the best-covered European languages |
| Portuguese | 3.9% | 3.6% | Brazilian Portuguese is the training-data majority |
| Japanese | 4.9% (CER) | -- | Character-level metric; sentence quality is excellent |
Tier 1 languages benefit from commercial application pressure: dubbing, closed captioning, and search have driven dataset creation for decades. If you are transcribing in any of these, the choice of model matters less than the audio quality you feed it.
Tier 2: high accuracy (6-12% WER)
These languages have meaningful training data but either less volume than Tier 1 or more phonetic complexity. Most production use cases work well, but expect to correct occasional misheard proper nouns and technical terms.
| Language | Whisper large-v3 FLEURS WER | Notes |
|---|---|---|
| Korean | 7.0% (CER) | Character-level; sentence accuracy is generally high |
| Dutch | 6.1% | Benefits from proximity to German and English training data |
| Russian | 8.8% | Good on standard Russian; regional accents degrade |
| Arabic | 9.5% (Modern Standard) | Dialectal Arabic (Egyptian, Levantine, Gulf) is much harder |
| Turkish | 9.6% | Agglutinative morphology adds complexity |
| Polish | 8.6% | Well-covered Slavic language |
| Catalan | 5.1% | Punches above its speaker count due to dedicated datasets |
| Swedish | 7.0% | Strong for a smaller language; Nordic corpora are well-curated |
| Norwegian | 9.0% | Two written standards (Bokmål/Nynorsk) complicate evaluation |
| Ukrainian | 10.2% | Significant improvement post-2022 due to dataset growth |
| Danish | 9.6% | Difficult phonetics, but well-represented |
For Tier 2 languages, model choice starts to matter. Whisper large-v3, NVIDIA Canary-1B-v2, and Google USM tend to trade leads depending on the specific language, so benchmark-specific comparisons are worth checking before standardizing a pipeline.
Tier 3: medium accuracy (12-20% WER)
These languages are where AI transcription becomes visibly imperfect. Transcripts are still usable as a first draft, but expect to fix several errors per minute of audio, especially around named entities, numbers, and discourse particles.
| Language | Whisper large-v3 FLEURS WER | Notes |
|---|---|---|
| Vietnamese | 13.6% | Tonal; tone errors are common |
| Hindi | 13.8% | Strong variance across accents and code-switching with English |
| Thai | 13.3% (CER) | No spaces between words complicates tokenization |
| Greek | 13.5% | Smaller training corpus than other European languages |
| Romanian | 14.9% | Improving rapidly as datasets grow |
| Hebrew | 15.9% | Right-to-left script, rich morphology |
| Indonesian | 13.4% | Strong for its resource level |
| Croatian | 17.7% | Shared features with other South Slavic languages help |
| Serbian | 15.7% | Cyrillic and Latin scripts supported |
| Czech | 13.5% | Solid despite morphological complexity |
| Bulgarian | 15.6% | Slavic language with moderate resource level |
Code-switching -- where speakers alternate between two languages in a single utterance -- tends to hit Tier 3 languages harder than Tier 1 because training data is less likely to include the specific language pair.
Tier 4: lower accuracy (20-40% WER)
Languages in this tier often have hundreds of millions of speakers but limited labeled training data. Transcription produces a rough draft that is faster to edit than starting from scratch but requires substantial human review.
| Language | Whisper large-v3 FLEURS WER | Notes |
|---|---|---|
| Tamil | 29.4% | Dravidian language with complex morphology |
| Bengali | 28.8% | Large speaker base but underrepresented in training |
| Telugu | 32.8% | Similar challenges to Tamil |
| Swahili | 34.2% | Lingua franca of East Africa, growing dataset size |
| Filipino (Tagalog) | 22.4% | Heavy English code-switching is common in natural speech |
| Malay | 21.3% | Shared features with Indonesian help |
| Urdu | 26.3% | Related to Hindi but written in Perso-Arabic script |
| Nepali | 30.0% | Small training corpus |
| Punjabi | 29.1% | Punjabi-English code-switching is common |
| Kannada | 33.5% | Dravidian family |
| Marathi | 30.7% | Indo-Aryan language with moderate resources |
For Tier 4 languages, hybrid workflows where AI produces the first draft and a native-speaker editor cleans it up are typically the highest-throughput option. Pure human transcription is still faster than correcting heavily-garbled AI output in many cases.
Tier 5: low resource and experimental (>40% WER)
These languages either have very limited labeled data, significant phonetic distance from any language the model was trained on, or both. Transcription in these languages is usable for content indexing and search but not for publishable text.
Examples include Amharic (Ethiopia, ~42% WER), Yoruba (Nigeria, ~43% WER), Sinhala (Sri Lanka, ~48% WER), Khmer (Cambodia, ~50% WER), Lao (Laos, ~52% WER), Burmese (~55% WER), and Maltese (~45% WER). Numbers vary significantly across models and benchmarks. The gap is closing as community datasets grow, but for production use cases in these languages, specialized providers who have invested in language-specific data typically outperform general-purpose models by 5-15 WER points.
What drives the accuracy gap
Three factors explain most of the variance in WER across languages.
Training data volume is the single strongest predictor. Whisper was trained on 680,000 hours of audio, but 65% of that was English. Higher-resource languages get tens of thousands of hours; the lowest-resource languages get a few hundred. Every doubling of training data roughly halves the remaining WER until diminishing returns set in.
Phonetic and morphological complexity creates ceiling effects even with abundant data. Tonal languages (Mandarin, Vietnamese, Thai, Yoruba) force the model to distinguish phonetically similar words by pitch contour. Agglutinative languages (Turkish, Finnish, Swahili) construct long words from many morphemes, which interact with tokenization. Right-to-left scripts (Arabic, Hebrew) and logographic writing systems (Chinese, Japanese) shift the metric from WER to character error rate and change what counts as a substitution.
Audio domain match matters as much as language. A model trained primarily on read-aloud audiobook data will underperform on spontaneous conversation in the same language. For business transcription use cases (meetings, interviews, podcasts), model choice should be informed by whether the provider fine-tunes on conversational or broadcast audio rather than only clean monologue.
How to improve accuracy for lower-tier languages
There are practical steps that meaningfully reduce WER for any language, though the impact is larger when the baseline is higher.
Improve the audio before transcribing. Noise reduction, speaker isolation, and consistent recording levels can cut WER by 2-5 points on real-world audio. This audio-quality guide covers the fastest wins.
Provide domain context. Many transcription APIs accept a list of technical terms, proper nouns, or phrases likely to appear in the audio. These biased vocabularies reduce substitution errors for industry jargon and named entities by 10-30% when configured correctly.
Choose the right model per language. Whisper leads on some languages, NVIDIA Canary on others, and language-specific providers on a few (particularly Japanese, Korean, and Arabic). If a specific language is critical to your workflow, testing 2-3 providers on a representative sample is worth the hour.
Use a human editor for the last-mile. For Tier 3 and below, a native-speaker editor reviewing an AI transcript is roughly 5-8x faster than transcribing from scratch, and the final accuracy lands above 98%.
Platforms like Vocova support transcription across 100+ languages with automatic language detection, which removes the friction of choosing the right model per language. The detection happens before transcription begins, so you do not need to tag audio files by language in advance.
Frequently asked questions
Which language has the most accurate transcription?
English has the most accurate AI transcription in 2026, with state-of-the-art models reaching 1.4-2.7% WER on clean LibriSpeech audio and around 4% WER on real-world spontaneous speech. Spanish, Mandarin, French, German, Italian, and Portuguese are close behind in the 3-6% WER range.
How accurate is Whisper across languages?
Whisper large-v3 achieves below 10% WER on approximately 30 languages on the FLEURS benchmark, including all Tier 1 and most Tier 2 languages in this guide. Its accuracy degrades sharply below that tier, with some low-resource languages exceeding 50% WER.
What WER is considered "good"?
For most business applications, a WER below 10% produces a transcript that is faster to read and edit than the original audio. Below 5% is generally considered near-human accuracy. Above 20% requires significant manual correction to be usable as published text.
Why is my German transcription more accurate than my Thai transcription?
German is a Tier 1 language with tens of thousands of hours of training data, shared phonetic features with English (which has the largest dataset), and wide adoption in commercial transcription. Thai is a tonal, space-free language with significantly less labeled training data. Even the best models have a 7-10 point WER gap between the two.
Can I improve transcription accuracy for my specific language?
Yes. Audio quality improvements, custom vocabularies, and speaker-specific training data can all reduce WER by 5-15% in most languages. For Tier 3 and below, using a hybrid AI + human editor workflow produces final accuracy above 98% at a fraction of pure human transcription cost.
Are transcription benchmarks from FLEURS and Common Voice comparable to real-world audio?
Not directly. Benchmark audio is typically cleaner, read rather than spontaneous, and recorded with professional equipment. Real-world audio (meetings, phone calls, street interviews) typically produces 5-15 points higher WER than benchmark audio for the same language and model.
Summary
AI transcription accuracy in 2026 is a function of language tier, audio quality, and model-task fit. Tier 1 languages deliver near-human accuracy on clean audio; Tier 3 requires editing; Tier 5 is experimental. The gap between best and average performance on real-world audio has widened as top models have improved faster than mid-tier ones, making tool selection more consequential than it was three years ago.
If you are building or choosing a transcription pipeline, the most useful thing you can do is test your specific language and audio domain on 2-3 representative samples before committing. Benchmarks are a starting point, not a decision.
Sources and further reading
- OpenAI, "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper paper, 2022)
- Google Research, "FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech" (2022)
- Hugging Face Open ASR Leaderboard
- NVIDIA, Canary-1B-v2 model card
- Mozilla Common Voice datasets
- Vocova on multilingual transcription
