LLMS can now talk in real time with minimal latency: Chinese researchers release Llama-omni2, a scalable modular language model

0 0 3 minutes read

LLMS can now talk in real time with minimal latency: Chinese researchers release Llama-omni2, a scalable modular language model

Researchers from the Institute of Computing Technology, Chinese Academy of Sciences introduced Llama-omni2a family of large language models with speech-capable capabilities (SpeechLMS), can now be used on hugging faces. This study introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascading systems, Llama-omni2 operates in end-to-end pipelines while maintaining modular interpretability and low training costs.

Overview of Llama-omni2 architecture

Llama-omni2 includes models with parameters ranging from 0.5B to 14B, each model built on the QWEN2.5-INSCRUCT series. The architecture includes:

Voice encoder: Use Whispers-V3 to convert input voice into a token-level acoustic representation.
Voice adapter: Use downsampling layers and feed networks to encoder output to be consistent with the input space of the language model.
Core LLM: The QWEN 2.5 model is the main reasoning engine.
Stream TTS decoder: Use an autoregressive transformer to convert the LLM output to a voice token, and then generate the MEL spectrogram through a causal stream matching model inspired by cosyvoice2.

The gating mechanism integrates LLM with the embedded version of text before speech synthesis, enhancing context fidelity in the generated audio.

Streaming generation through read and write plans

This model uses read and write strategies to facilitate streaming output. Specifically, each R The token generated by LLM, W Generate a voice token. This allows synchronized text and acoustics to synchronize, minimizing delay without compromising fluency.

Empirical findings show that setting r=3 and W=10 provides a favorable trade-off between delay (~583 ms), alignment (ASR-WER: 3.26) and perceived quality (UTMOS: 4.19).

Training method

Despite its competitive performance, Llama-omni2 received training in a relatively compact corpus – 200K multi-turn voice-to-voice conversation sample. These samples were synthesized according to instructions following text datasets (Alpaca, Ultrachat), with multiple input sounds and generate consistent output speech using FishSpeech and Cosyvoice2 models.

The training is performed in two phases:

Phase 1: Independently optimized speech-to-text and text-to-speech modules.
Phase 2: Fine-tune the generation path of speech-to-speech, including gated and automatic regression decoding components.

Benchmark results

Tasks using Voice-to-Text (S2T) and Voice-to-Speech (S2S) modes These models are evaluated below tasks.

Model	Llama Q (S2S)	Web Q (S2S)	GPT-4O score	ASR-WER	Incubation period (MS)
GLM-4-VOICE (9b)	50.7	15.9	4.09	3.48	1562.8
Llama-Omni (8b)	49.0	23.7	3.52	3.67	346.7
Llama-omni2-7b	60.7	31.3	4.15	3.26	582.9

The performance scale is consistent with the model size. It is worth noting that Llama-omni2-14b performs better than all baselines across tasks, even if the training data is less than local voice such as GLM-4-voice.

Component Analysis

Door Fusion Module: Deleting the gating mechanism increases ASR-wer and reduces speech quality, confirming its role in aligning text and context signals.
Pre-training: Initialize the TTS model from QWEN2.5 and fine-tune it in the stream settings to produce optimal performance. Training from scratch cannot effectively converge.
Read/write strategy: Adjusting the R:W ratio will affect the delay and quality. Larger Ws can improve UTMO at the expense of response delay.

Furthermore, the study shows that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities and is in performance plateaus near 200k samples.

in conclusion

Llama-omni2 proves that high-quality, low-latency interactions with LLMS are feasible without the need for extensive preprocessing of large-scale speech corpus. By combining modular architecture with self-synopsis streaming media synthesis, the system provides a practical approach for real-time voice applications.

Check Paper, hug face model and github page. Also, don’t forget to follow us twitter.

Here is a brief overview of what we built in Marktechpost:

ML News Community – R/Machinelearningnews (92K+ Member)

Newsletter – airesearchinsights.com/(30K+subscribers)

Minicon AI Events – minicon.marktechpost.com

AI Reports and Magazine – Magazine. markTechpost.com

AI Development and Research News – Marktechpost.com (1M+ Monthly Reader)

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.