Blog

YouTube

Discord

YouTube

Discord

Taming the TTS Beast: Inference-Time Techniques for High-Quality, Low-Latency Speech Generation

July 4, 2025

0:00/1:34

Before: Taming The TTS Beast (no candidates)

0:00/1:34

After: Taming The TTS Beast (Best of 8 candidates)

At PlayAI, we're passionate about building frontier conversational models, specializing in text-to-speech (TTS) and speech-to-speech (S2S) synthesis. We began our journey in 2018 at the application layer, but found existing voice models lacking. We then set out to make computers speak like humans, beginning with optimizing the transformer architectures for real-time speech applications. Today, a talented ML research team at PlayAI is pushing the boundaries of what's possible in the conversational voice space.

Our path has been paved with experimentation - tens of model architectures and thousands of targeted experiments. One of our key strategies has been around how we apply sophisticated inference-time techniques. These techniques allow us to extract the maximum potential from our models, ensuring reliability and quality, especially when serving businesses that demand consistent performance.

This blog post will dive into the inference-time techniques we've found effective, particularly for autoregressive TTS models, discussing their complexity, effectiveness, and limitations.

The Model Development Flywheel

Progress is not a straight line but a continuous cycle of experimentation and improvement. Our development flywheel is the process that connects foundational research with the demands of flawless user experiences. It's how we learn, build, and improve, ensuring each model is more capable and reliable than the last.

The steps in the flywheel are:

Design: Begin by designing the next generation of models. This process is fueled by the latest research, lessons learned from our current products, and direct customer feedback.

Build: Build the new base model, training it on vast, curated datasets to achieve a new level of naturalness and capability.

Refine & Harden: A raw model has immense potential but also rough edges. This is a critical stage where we truly tame the beast. Through the inference-time techniques detailed in this post, we harden the model - promoting its strengths, addressing its weaknesses, and installing the crucial "guardrails" that ensure it performs reliably in the wild.

Deliver: The hardened model is delivered to our customers. Here, our focus is on a seamless, high-quality experience, served efficiently and at scale.

Then the cycle begins again. The insights we gather from production - every challenge and success - become the fuel for the next design phase.

This post is a deep dive into that crucial Refine & Harden stage - our approach for bridging the gap between a powerful generative model and a truly dependable, high-quality product.

The Core Challenges: Streaming, Latency, and Continuity

Before we explore specific techniques, let's set the stage. Generating speech from text involves feeding a tokenized input (text) into a TTS model. The model then autoregressively generates a sequence of acoustic tokens. These tokens represent discrete units of speech, which are subsequently converted into an audible waveform by a vocoder or audiocodec.

Acoustic Tokens & Codecs: These tokens can represent various acoustic features, such as Mel spectrogram frames, or discrete units from neural audio codecs. The choice of token representation impacts generation speed; models typically generate anywhere from 20ms to 100ms of speech per token, depending on the sampling rate and codec.

Streaming for Low Latency: A cornerstone of good speech synthesis is low Time To First Byte (TTFB) and sub-realtime generation (generating audio faster than it's spoken). To achieve this, we stream audio to the user. This is typically done by generating audio tokens in chunks. We start with smaller chunks to minimize initial latency and then continue with larger, N-token chunks. There are couple of challenges stemming from that:

Audio Continuity: Chunks must flow smoothly without audible artifacts, clicks, or pops.
Contextual Coherence: Each chunk needs enough context from previous tokens to sound natural and maintain prosodic consistency.
Latency Constraints: All processing for a chunk must happen quickly to keep the stream going.

All the inference techniques we'll discuss operate within this demanding context of streaming chunks of tokens, converting them to speech, and delivering them seamlessly.

Foundational Strategy: Parallel Candidate Generation

The core of our low-latency, high-reliability strategy lies in parallel candidate generation. Instead of generating a single audio output for a given text chunk, our model generates multiple (typically 4-8) candidate audio chunks in parallel.

Each of these candidates represents a slightly different interpretation of the text, with variations in prosody, pacing, and emotional emphasis. Once generated, these candidates are passed to a selection module, which uses one or more of the ranking techniques described below to select the "best" one. The audio from this winning candidate is then streamed to the user.

Crucially, the internal state of this winning candidate (its token history, attention state, etc.) is used as the starting point for generating the next set of parallel candidates in the subsequent step. This ensures a seamless and contextually coherent transition from one audio chunk to the next.

This transforms TTS into a dynamic process with built-in choice. The critical question then becomes: how do we select the best candidate?

Taming the beast requires a sophisticated toolkit, and ours has evolved significantly over time. The following sections walk through this evolution, from our foundational techniques to our current state-of-the-art. You will see how our approach has matured—often by replacing a successful older method with a more robust successor, and in some cases, by learning to elegantly combine signals from multiple sources into a single, decisive logic.

This strategy of exploring multiple potential futures to enhance the final output is not unique to TTS and reflects a broader trend in generative AI [R1]. At inference time, similar concepts can be seen in Large Language Models for text generation, such as lookahead decoding where smaller models are used to generate and score candidate token sequences to guide the primary model. Furthermore, the core idea of learning from multiple generated outputs is also being integrated directly into the training phase with techniques like GRPO (Group-Relative Policy Optimization). In GRPO, a single policy model generates a group of candidate outputs for a given prompt; these candidates are then scored and ranked relative to each other to guide the model's optimization. Whether applied during training or inference, this general principle of "generate-and-rank" is proving to be a powerful paradigm for building more reliable and high-performing generative models.

Chunk by Chunk Selection:

This process happens at each chunk boundary during streaming:

Generate Candidates: For the current text segment to be synthesized, we generate N distinct candidate audio chunks.
Rank Candidates: Once all N candidates for the chunk are generated, we apply a ranking strategy (which we'll detail in subsequent sections) to evaluate them and generate scores for each and select the "best" one.
Ensure Continuity: To ensure smooth audio flow into the next chunk, the internal state and Key-Value cache of the chosen best candidate is propagated. For the next set of N parallel generations, all of them will start from this same selected state. This means that even though we explored multiple paths for the current chunk, the next chunk's generation begins from a single point.
Stream and Repeat: The selected best chunk is sent to the vocoder and streamed to the user. The process then repeats for the subsequent text segment.

This method allows us to explore a small, localized "search space" at each step, significantly increasing the chances of finding a high-quality path through the generation process and avoiding getting stuck in a suboptimal generation trajectory.

Technique: Guided Generation (Steering with Unconditional Prompts)

Even the most powerful TTS models, our beasts, can default to generic-sounding speech or stray from the intended voice or style, especially over longer sentences. Scale alone doesn’t generate human-like speech. To guide them more effectively, we employ a technique inspired by Classifier-Free Guidance (CFG) [R2], a popular method in image generation.

The core idea is straightforward: we show the model not only what we want it to generate (the "conditional" prompt) but also an example of what we don't want, or what it might produce without specific instructions (the "unconditional" or "null" prompt). By contrasting these, we can steer the generation process.

At Play AI, we've previously tailored this approach for TTS with three types of guidance:

Text Guidance:
- Goal: To keep the model strictly on-script and prevent it from mumbling, repeating, or generating hallucinations.
- How: We create an "unconditional text prompt" by taking the structure of our original input but replacing the actual text tokens with special placeholder tokens.
Voice Guidance:
- Goal: To ensure the generated voice consistently matches the target speaker and doesn't drift towards a generic voice.
- How: We make an "unconditional voice prompt" by using the original text and style cues but substituting the target speaker's specific voice identifiers with "generic voice" tokens.
Style Guidance:
- Goal: To maintain the desired emotion, prosody, language, or speaking style throughout the audio.
- How: We create an "unconditional style prompt" that keeps the original text and speaker information but replaces the specific style identifiers with "generic style" tokens. This captures a default or neutral delivery.

The Steering Mechanism: Adjusting Model Outputs

During generation, for each active guidance, our system essentially gets two sets of predictions (logits) from the model:

Conditional Logits: From the prompt with all our desired specifics (text, voice, style).
Unconditional Logits: From the corresponding "null" prompt (e.g., meaningless text, generic voice, or generic style).

Before the model picks the next sound token, we adjust the conditional logits. For example, with text guidance:

Final Logits = Conditional Logits - (Text_Guidance_Strength * Unconditional_Text_Logits)

We do a similar subtraction for voice and style guidance, each with its own "strength" coefficient. This nudges the model's choices away from the generic/unconditional outputs and towards the specific characteristics we want, making the speech clearer, more consistent in voice, and more expressive in style. These operations are batched for efficiency, running different guidances in parallel.

Key Benefits:

This significantly boosts control. It helps our TTS beast stick to the script, maintain speaker identity, and deliver the intended style more reliably.

Challenges and Considerations:

Extra Computation: Each type of guidance adds more computational work, which can impact latency and throughput.
Tuning the Strength: The "strength" coefficients are key. Too low, and the guidance is weak; too high, and the speech can sound unnatural or exaggerated. Finding the right balance is crucial. Combining multiple guidances becomes tricky and usually leads to undesirable effects as they interfere with each other.

Proactive steering is a powerful start, but our next step on the evolutionary ladder was to reactively inspect the generated audio. Our earlier models accomplished this by learning to interpret the TTS model's own "internal monologue." The first such method we employed, which served us well for a time, was to analyze the model's attention patterns.

Ranking Strategy: Attention-Based Selection

Building on our parallel candidate generation strategy, we can employ a ranking method that utilizes an attention head which naturally exhibits strong alignment properties within the TTS model.

Identifying a Suitable Internal Alignment Head:

The first step is to identify an attention head within the model that exhibits a strong, monotonic alignment between text tokens and their corresponding acoustic representations. This usually involves analyzing the attention patterns across different layers and heads for a diverse set of text inputs. Once identified, this attention head becomes our internal “compass.” It's worth noting that while this approach leverages existing model components, our later-generation models have moved towards more advanced techniques. These include training dedicated, lightweight alignment models or even having the generative model produce its own explicit text-to-speech alignment data alongside the audio tokens, offering a more robust and direct signal.

Ranking with Custom Heuristics:

Generate Candidates: As before, produce N different audio versions for the current text segment using different random seeds.
Extract Attention Maps: For each of the N generated candidates, extract the attention map from the pre-identified internal alignment head.
Apply Custom Heuristics for Scoring: Each candidate is then scored based on the characteristics of its attention map. This scoring relies on:
- Custom Heuristics: We define a set of rules and preferences that quantify "good" alignment. For example:
  - Higher scores for attention patterns that are sharply focused on the current text token.
  - Penalties for patterns that are overly diffuse, indicating model uncertainty.
  - Significant penalties if the attention pattern indicates a stall on a previous word (potential repetition) or a premature jump to a future word (potential skip).
  - Preference for attention that progresses smoothly and monotonically along the expected diagonal.
- Optimized Kernels: To efficiently apply these heuristics across numerous candidates in real-time streaming scenarios, we implement custom, optimized compute kernels.
Select the Best Candidate: The audio candidate achieving the highest score based on these alignment heuristics is selected as the "best" for the current chunk.

Key Benefits:

Reduces hallucinations and skips: Like text-based LLMs, speech models can hallucinate by repeating or skipping words. Directly selecting candidates with superior text-to-speech alignment significantly reduces word skipping, repetition, and other errors where the speech deviates from the source text.
Utilizes Model's Intrinsic Signals: This method leverages information already present and learned by the model, avoiding the need for external alignment tools or models for this ranking step. Additionally, the scoring heuristics can be refined and tuned based on the specific behaviors and common failure modes observed in the particular TTS model.

Challenges and Considerations:

Dependence on Overall Model Stability: If the TTS model enters a state of severe misgeneration (e.g., deep hallucination or producing unintelligible noise), the attention patterns from the identified internal alignment head are also likely to become unreliable or meaningless for all candidates. In such scenarios, the attention-based ranking may not be able to discern a truly good candidate. [S1]

Because of these limitations, we discovered that attention, while conceptually elegant, was not the most reliable signal. We soon evolved to a more direct and simpler indicator of quality which replaced our attention-based methods: analyzing the model's own token-level confidence.

Ranking Strategy: Probability & Entropy-Based Selection

When faced with multiple generated speech candidates, we can use the model's internal confidence scores - specifically, the probabilities it assigns to each chosen acoustic token - to guide our selection towards the most accurate output.

The probability of a chosen token indicates how "sure" the model was about that specific output. Similarly, the entropy of the probability distribution over all possible next tokens gives us a measure of this uncertainty.

While higher probability often seems better, we've observed that extreme average probabilities for an entire candidate chunk (either too low or too high) frequently signal problems:

Too Low Average Probability (e.g., below 0.15 for the chunk): This indicates widespread model uncertainty. Such chunks are highly prone to containing mispronunciations, saying incorrect words, producing silence where speech should be or hallucinating.
Too High Average Probability (e.g., above 0.5 for the chunk): While seemingly positive, this "overconfidence" can also be problematic. We have observed that fluent conversational models are usually overconfident when hallucinating.

Our goal is to find candidates that reside in a "sweet spot" of reasonable confidence, avoiding these extremes.

Our Ranking Process:

Filter by Viable Confidence Range: For each candidate audio chunk, we compute its average token probability. We then eliminate any chunks where this average falls outside a pre-defined "viable" range. This step is designed to discard candidates most likely to exhibit the error types mentioned above.
Rank Remaining Candidates by Highest Probability: Among the candidates that fall within our desired confidence range, we then select the one with the highest average token probability.

Key Benefits:

Reduces Common Errors: Helps significantly filter out candidates prone to mispronunciations, incorrect words, unwanted silences, or random hallucinations by avoiding outputs where the model is either too uncertain or suspiciously overconfident. For instance, in our evaluations, we have seen this simple ranking strategy improve Word Error Rate (WER) by 47% from 6.56% to 3.49% and Character Error Rate (CER) from 3.76% to 1.75% on the CMU Arctic dataset for some of our more unstable model checkpoints.
Computationally Efficient: This ranking uses readily available probability information directly from the model, adding minimal computational overhead to the candidate selection process.

Challenges and Considerations:

Model-Specific Thresholds: The precise probability values defining the "viable range" are empirical and must be carefully tuned for each specific TTS model and its typical output characteristics.
Heuristic Approach: While effective, this is a heuristic. Probability is an indicator, not a perfect predictor, of perceptual quality or absolute correctness. [S2]

While confidence scores are an excellent and efficient way to ensure the model is generating stable and coherent audio, they don't explicitly verify that the model spoke the correct words. To address the critical challenge of textual accuracy - ensuring the beast adheres to the script - we employ a different and more specialized technique from our toolkit: making the model listen to itself.

Ranking Strategy: ASR-based "Self-Correction"

Another approach we have seen work, especially for ensuring textual accuracy and adherence to desired speech characteristics, is using an ASR-based ranking method. The Core Idea is reversing the TTS task for evaluating candidates and generating ranking scores.

After generating N distinct speech candidates for a given chunk:.

Evaluate with the Same Model in "ASR Mode": For each candidate, we use our conversational model to assess how well its generated acoustic tokens correspond to the original input text. Instead of a traditional ASR that outputs text, we essentially ask the model: "Given these generated acoustic tokens, what is the likelihood that they represent this original input text?"
The candidate for which the model assigns the highest likelihood of matching the input text is considered the best in terms of textual fidelity.

Extending to Speech Characteristics (e.g., Emotion Scoring):

This "self-evaluation" concept can extend beyond mere textual accuracy. If the input prompt included desired speech characteristics (e.g., "speak angrily," "whisper"), and our model is capable of classifying or recognizing these attributes in speech:

For each candidate, we can use the model to evaluate how well it embodies the requested characteristic. For instance, if "angry speech" was requested, the model can analyze each generated audio candidate and provide a score indicating the perceived "anger."
This "description score" can then be factored into the overall ranking, or used as the primary ranking criterion if the attribute is critical.

Key Benefits:

Synergy and Efficiency: Using the same model for both generation and evaluation can be more efficient than employing a completely separate ASR model. It leverages the existing loaded model, its learned representations and works in acoustic token space.
Attribute Matching: The ability to score against desired speech characteristics like emotion is a powerful way to ensure the model isn't just saying the right words, but saying them in the right way.

Challenges and Considerations:

Computational Cost: Performing this ASR-like evaluation for N candidates adds computational overhead, which can impact latency. While candidate evaluations can be batched, the need for quick decisions on each chunk during streaming places constraints on how much batching is feasible without increasing TTFB or breaking the real-time flow.
Effectiveness on Short Chunks: A significant challenge in streaming is that ASR systems (even when part of the same model) often perform best with more contextual information. Evaluating very short, isolated audio chunks can be less reliable. The model might struggle to accurately assess such brief acoustic segments without sufficient preceding or succeeding audio context. This can make chunk-by-chunk ASR scoring less robust unless the model is specifically trained or adapted for this type of short-context evaluation. [S3]

Each of these methods marked a significant step forward in our journey. Our most robust solution to date, however, comes from introducing a dedicated, expert supervisor. It is here, at the top of our evolutionary ladder, that we move from replacing techniques to elegantly combining them. This state-of-the-art approach integrates signals - like the TTS model's own confidence scores - with the analysis of a specialized aligner model, all governed by a complex decision-making logic.

Ranking Strategy: Dedicated Streaming Aligner with Advanced Logic

While internal model signals are useful, one of the most robust techniques for candidate selection, especially in demanding streaming scenarios, employs a separately trained, specialized aligner model [R3] [R4].

Dedicated Aligner Model:

This model is explicitly designed to operate on acoustic tokens, assess their alignment to text, and identify common TTS issues like unwanted silence or hallucinations. However, the aligner's raw output is just the starting point.

Intelligent Post-Aligner Logic:

The true strength of this approach lies in the complex decision logic we've built on top of the aligner's predictions. This logic intelligently combines multiple signals to make the final selection:

Generate Candidates & Get Aligner Scores:
- As usual, N parallel speech candidates are generated for the current text chunk.
- Each candidate's acoustic tokens are fed to the dedicated aligner, which outputs likelihoods for each acoustic frame corresponding to:
  - Each token in the input text.
  - A "silence" state.
  - A "missing text" or "hallucination" state.
Multi-Factor Candidate Scoring: The aligner's raw probabilities are then processed by a ranking algorithm that considers:
- Text Coverage: How well the generated audio covers the intended text tokens, rewarding candidates that make good progress.
- Penalizing Undesirables: Explicitly penalizing candidates with a high likelihood of "missing text" (hallucinations) or excessive/misplaced silence, using scales that can be adjusted based on context.
- TTS Model Confidence Integration: Incorporating the TTS model's own output probabilities for the generated units, often applying penalties if a candidate's average probability falls outside an empirically determined "healthy range" (as discussed in Technique 3). This acts as a cross-check.
- Sequential Flow: Analyzing the progression of alignment through the text. Penalties are applied if a candidate seems to get "stuck" on a text token, tries to go backward, or exhibits other non-monotonic behaviors.
- Stream State Awareness: The logic adapts based on the streaming context. For instance, it considers whether candidates have naturally "finished" generating and might prioritize these, especially towards the end of an utterance. It also includes logic for truncating audio or forcing an end to the stream if necessary to prevent runaway generation.
Candidate Selection:
- The various scores and penalties are combined into a final score for each candidate.
- The candidate with the best overall score, reflecting a balance of textual accuracy, avoidance of errors, good acoustic properties (via TTS probabilities), and smooth progression, is selected.
- The system also updates its tracking of the current position in the text based on the selected candidate, influencing future decisions.

Key Benefits:

Optimized for Streaming: The logic is designed to make good decisions chunk-by-chunk, adapting to the evolving state of the generated speech.
Superior Robustness: This multifaceted method significantly enhances robustness and accuracy by effectively mitigating potential flaws in both the TTS model and the aligner. Our evaluations demonstrate these gains, for instance, achieving a WER of 2.20% and a CER of 0.88% on the CMU Arctic dataset with this method.
Balancing Metrics with True Perceptual Quality: Critically, these accuracy improvements are achieved while preserving very natural prosody and expressivity - a balance we consider essential for genuine high-quality synthesis. While it's technically possible to achieve far lower WER/CER figures, this often involves constraining the model to produce less expressive, more monotonous speech. Such an approach can misleadingly improve scores, partly because ASR systems used for evaluation tend to process simpler, less dynamic audio more consistently, and misinterpret rich prosodic variations or expressive nuances in lively speech as errors.

Challenges and Considerations:

Computational Overhead: Running a separate aligner model and complex scoring logic for multiple candidates per chunk is computationally intensive, impacting latency and throughput under high batch sizes.
Complexity: Developing and tuning this intricate system (the aligner model and the extensive ranking logic) is a significant engineering effort. The various scales, thresholds, and penalties within the scoring logic require careful and extensive empirical tuning.

The Ongoing Quest for Perfect, Reliable Speech

"Taming the TTS Beast" is an ongoing journey. While our foundational models are capable of producing impressive speech, bridging the gap to the near-perfect, reliable performance demanded by business-critical applications requires the inference-time strategies we’ve discussed. They are the "final mile" techniques, providing the crucial guardrails and polish needed to ensure every utterance is natural, accurate, and engaging.

Our work in this area is far from over. We are constantly striving to:

Simplify: Streamline these inference-time techniques for greater efficiency without sacrificing quality.
Adapt: Tune and evolve these methods for the latest model architectures we develop.
Integrate: Explore ways to move some of these techniques directly into the model training process, further simplifying the inference stack while maintaining peak speech quality.

This drive to simplify, adapt, and integrate is the very engine of our Development Flywheel, turning today’s inference-time learnings into tomorrow's more capable base models. The goal remains the same: to generate speech that is indistinguishable from human speech, while providing a level of granular control and flawless consistency that is beyond the scope of human performance. ⬢

—

References and Further Reading

[R1] Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Source)

[R2] Classifier-Free Diffusion Guidance. (Source)

[R3] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. (Source)

[R4] Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (Source)

Reference Samples

[S1] “Real love should draw no blood from the loved and buckets from the lover. The thing I never understood about love is that it can't be quelled, like lust can.”

0:00/1:34

Single candidate, no ranking

0:00/1:34

Attention-Based Selection

[S2] “Real love should draw no blood from the loved and buckets from the lover. The thing I never understood about love is that it can't be quelled, like lust can.”

0:00/1:34

Single candidate, no ranking. Notice the Hallucination in the middle.

0:00/1:34

Probability and Entropy-Based Selection

[S3] “The universe is a wild beast. You can't tame it. All you can do is try to live inside it.”

0:00/1:34

Single candidate, no ranking

0:00/1:34

ASR-based “Self-Correction”

PlayAI

Index