What is speaker diarization? How AI identifies speakers in audio

Speaker diarization is the process of automatically identifying and segmenting different speakers within an audio recording, answering the question "who spoke when." It is a core component of modern automatic speech recognition pipelines, enabling transcripts that attribute each spoken segment to the correct individual without requiring any prior knowledge of the speakers' identities.

Whether you are reviewing a meeting recording, transcribing a podcast episode, or analyzing a legal deposition, speaker diarization transforms a flat wall of text into a structured, readable document where every sentence is tied to the person who said it.

What is speaker diarization?

Speaker diarization, sometimes spelled "diarisation," partitions an audio stream into homogeneous segments according to the identity of the speaker. The term derives from the word "diary" -- just as a diary records who did what and when, diarization records who said what and when within a conversation.

In technical terms, a diarization system takes raw audio as input and produces a set of time-stamped labels such as "Speaker A: 0.0s -- 4.2s," "Speaker B: 4.3s -- 7.8s," and so on. The system does not need to know the speakers' names or have heard their voices before. It simply groups segments that belong to the same voice together under a consistent label.

Speaker diarization is distinct from speaker identification (matching a voice to a known identity) and speaker verification (confirming whether a voice belongs to a claimed identity). Diarization operates in an unsupervised fashion: it discovers how many speakers are present and clusters their speech accordingly.

How speaker diarization works

Modern diarization systems follow a multi-stage pipeline. While implementations differ, most share these core steps.

Voice activity detection

The first step is determining which parts of the audio contain human speech versus silence, music, or environmental noise. Voice activity detection (VAD) filters out non-speech regions so downstream components only process relevant audio. High-quality VAD is critical -- missed speech segments can never be recovered, and false positives introduce noise into the pipeline.

Speech segmentation

Once speech regions are identified, the audio is divided into short, uniform segments, typically between 0.5 and 2 seconds in length. These segments form the basic units that the system will analyze and assign to speakers.

Speaker embedding extraction

Each segment is passed through a neural network that produces a fixed-dimensional vector, called a speaker embedding, that captures the unique vocal characteristics of the speaker. These embeddings encode properties like pitch, timbre, speaking rate, and vocal tract shape into a compact numerical representation.

Early systems used i-vectors for this purpose. Modern systems rely on deep neural network embeddings, particularly d-vectors and x-vectors. X-vectors, introduced by researchers at Johns Hopkins University, use a time-delay neural network architecture and have become a standard in the field. More recent approaches use ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Networks), which achieves superior performance through multi-scale feature aggregation and channel attention mechanisms.

Clustering

With embeddings extracted for every segment, the system groups segments from the same speaker together. This is fundamentally a clustering problem. Common approaches include:

Agglomerative hierarchical clustering (AHC): Starts with each segment as its own cluster and iteratively merges the two most similar clusters until a stopping criterion is met. This is the most widely used method.
Spectral clustering: Constructs a similarity graph from embeddings and uses eigenvalue decomposition to find natural groupings.
k-means clustering: Partitions embeddings into a fixed number of clusters, though this requires knowing the number of speakers in advance.

The choice of clustering algorithm significantly affects both accuracy and the system's ability to estimate the number of speakers automatically.

Resegmentation

After initial clustering, a refinement pass re-examines speaker boundaries to correct errors. Segments near speaker transitions are often misassigned during initial clustering. Resegmentation uses Viterbi decoding or similar sequential models to smooth boundaries and enforce temporal consistency.

Why speaker diarization matters

Speaker diarization is not merely a technical convenience. It is essential for making audio content truly usable as text.

Meetings and collaboration. In a multi-participant meeting, a transcript without speaker labels is difficult to follow. Diarization lets teams quickly see who raised which points, who agreed to action items, and who asked which questions. This is particularly valuable for remote and hybrid teams reviewing recorded meetings.

Interviews and journalism. Journalists, researchers, and hiring managers need to distinguish interviewer from interviewee. Diarization automates what was previously a tedious manual process of annotating transcripts.

Podcasts and media. Podcast transcripts with speaker labels are more accessible, more searchable, and more useful for show notes and repurposing content. They also improve SEO by making content indexable per speaker.

Legal and compliance. Court depositions, regulatory hearings, and compliance recordings all require accurate attribution of statements to specific individuals. Errors in attribution can have serious consequences.

Healthcare. Clinical conversations between doctors and patients must be accurately documented. Diarization helps automated medical scribes attribute symptoms, diagnoses, and instructions to the correct party.

Accessibility. For deaf and hard-of-hearing users, captioned content with speaker identification is dramatically more useful than undifferentiated text.

Types of diarization approaches

Offline vs. online diarization

Offline diarization processes a complete audio file after recording has finished. It can analyze the entire conversation to make globally optimal decisions about speaker assignments. This approach generally produces higher accuracy because the system has access to all available information.

Online (real-time) diarization processes audio as it arrives, assigning speaker labels with minimal latency. This is necessary for live captioning, real-time meeting assistants, and voice-controlled systems. The trade-off is reduced accuracy, since the system cannot look ahead to resolve ambiguous segments.

End-to-end neural diarization

Traditional diarization pipelines chain multiple independent modules together. End-to-end neural diarization (EEND), pioneered by researchers at Hitachi and NTT, replaces this pipeline with a single neural network that directly outputs speaker labels for each time frame.

EEND models are trained on multi-speaker audio mixtures and learn to jointly handle voice activity detection, overlap detection, and speaker assignment. The EEND-EDA (encoder-decoder attractor) variant can handle flexible numbers of speakers without a fixed upper limit, addressing a key limitation of earlier EEND approaches.

Hybrid approaches

Many state-of-the-art systems combine neural and clustering-based methods. For example, a system might use a neural network for embedding extraction and overlap detection, then apply clustering for speaker assignment, and finally refine results with a neural resegmentation model.

Challenges in speaker diarization

Despite significant progress, several problems remain difficult.

Overlapping speech

When two or more speakers talk simultaneously, traditional diarization systems struggle because each time frame is typically assigned to a single speaker. Overlap-aware models like EEND handle this better, but overlapping speech remains one of the largest sources of error. In natural conversation, overlap can account for 10--20% of speaking time.

Similar voices

Speakers of the same gender, age group, and dialect can produce very similar embeddings, causing the clustering algorithm to merge them into a single speaker. This is especially challenging in homogeneous groups, such as a panel of speakers with similar vocal characteristics.

Short utterances

Very brief turns -- a quick "yes," "right," or "mm-hm" -- provide little acoustic information for embedding extraction. These short segments are frequently misassigned.

Variable recording conditions

Diarization accuracy degrades with background noise, reverberation, low-quality microphones, and varying recording distances. A speaker close to the microphone and a speaker across the room produce very different audio characteristics, even though the system must recognize them consistently.

Unknown number of speakers

In most real-world scenarios, the number of speakers is not known in advance. The system must jointly estimate the speaker count and assign labels. Overestimating splits one speaker into two; underestimating merges two speakers into one.

How accurate is speaker diarization?

Diarization accuracy is measured using diarization error rate (DER), which combines three types of errors: missed speech (speech that goes undetected), false alarm (non-speech labeled as speech), and speaker confusion (speech attributed to the wrong speaker). Lower DER is better.

On well-studied benchmarks, the current state of the art achieves:

CALLHOME (telephone conversations): DER in the range of 5--10%, depending on the system and evaluation conditions.
AMI meeting corpus: DER between 10--20% for far-field recordings, lower for close-talk microphones.
DIHARD challenge (diverse, difficult audio): DER of roughly 24--35% or higher depending on the track -- DIHARD III reported about 23.7% with reference speech-activity detection and around 35.5% from scratch -- reflecting the difficulty of real-world conditions including children's speech, web video, and clinical interviews.

For typical two-speaker conversations recorded with decent audio quality, modern systems regularly achieve DER below 5%. Performance degrades as the number of speakers increases, audio quality decreases, or overlap becomes more frequent.

It is worth noting that DER measurements vary significantly depending on evaluation protocol. The forgiveness collar (a small time buffer around speaker transitions that is excluded from scoring) and whether overlap regions are scored both materially affect reported numbers. When comparing systems, ensure the evaluation conditions match.

Speaker diarization in practice

In transcription tools like Vocova, speaker diarization works alongside automatic speech recognition to produce labeled transcripts directly from uploaded audio. You upload a recording -- a meeting, interview, podcast, or any multi-speaker audio -- and the system returns a transcript where each segment is tagged with a speaker label and timestamp.

Vocova processes audio in 100+ languages with automatic language detection and applies diarization to identify individual speakers throughout the recording. The result is a structured transcript that you can export as PDF, SRT, VTT, DOCX, or other formats, with speaker labels preserved. This eliminates the manual work of listening back and annotating who said what.

For teams and individuals who work with multi-speaker recordings regularly, automated diarization can reduce post-recording processing time from hours to minutes.

Frequently asked questions

What is the difference between speaker diarization and speaker recognition?

Speaker diarization segments audio by speaker without knowing who the speakers are. It answers "who spoke when" by grouping speech from the same voice under a consistent label like "Speaker 1" or "Speaker 2." Speaker recognition, by contrast, identifies a specific known individual by matching their voice against a stored voiceprint. Diarization is unsupervised; recognition requires prior enrollment of known speakers.

How many speakers can diarization handle?

There is no hard technical limit, but accuracy decreases as the number of speakers increases. Most systems perform well with 2--6 speakers. Beyond 8--10 speakers, error rates rise significantly due to the difficulty of distinguishing many voices and the increased likelihood of short turns and overlapping speech. For large group recordings, combining diarization with additional metadata (such as microphone assignments) can improve results.

Does speaker diarization work in real time?

Yes, online diarization systems can assign speaker labels with low latency, typically within a few seconds. Real-time diarization is used in live captioning, meeting assistants, and voice analytics platforms. However, real-time systems generally have higher error rates than offline systems that process complete recordings, because they cannot use future context to resolve ambiguous segments.

Can diarization tell me the speakers' names?

Not by itself. Diarization assigns anonymous labels (Speaker 1, Speaker 2, etc.) because it does not know who the speakers are. To map labels to names, you need either speaker identification (matching against known voiceprints) or manual annotation after the fact. Some transcription tools allow you to rename speaker labels after diarization is complete.

How does audio quality affect diarization accuracy?

Audio quality has a substantial impact. High-quality recordings from close-talk microphones in quiet environments yield the best results. Background noise, reverberation, low bitrate compression, and far-field recording (speaker far from the microphone) all degrade accuracy. Phone calls and conference room recordings with a single shared microphone are more challenging than individual headset recordings.

What is diarization error rate (DER)?

Diarization error rate is the standard metric for evaluating diarization systems. It is calculated as the total duration of errors (missed speech + false alarm speech + speaker confusion) divided by the total duration of reference speech. A DER of 0% means perfect diarization. State-of-the-art systems reach under 5% DER on easy two-speaker audio, but real-world conditions push it much higher -- the difficult DIHARD III benchmark sees roughly 24--35% or more. The metric is defined by NIST and is used across academic benchmarks and industry evaluations. For more on transcription accuracy metrics, see our guide on word error rate.

Sources and further reading

NIST Rich Transcription Evaluation — the diarization evaluation framework and Diarization Error Rate (DER)
Snyder et al., "X-Vectors: Robust DNN Embeddings for Speaker Recognition" (ICASSP 2018)
Desplanques et al., "ECAPA-TDNN" (Interspeech 2020)
Fujita et al., "End-to-End Neural Speaker Diarization with Permutation-Free Objectives" (2019)
Horiguchi et al., "End-to-End Speaker Diarization for an Unknown Number of Speakers (EEND-EDA)" (2020)
Ryant et al., "The Third DIHARD Diarization Challenge" (2020)