Tencent AI Lab Releases Unified Audio-Language Model Covo-Audio
Tencent AI Lab has announced the release of Covo-Audio, a new 7B-parameter model designed to unify speech processing and language intelligence. This model directly processes continuous audio inputs and generates audio outputs within a single architecture, eliminating the need for cascaded Automatic Speech Recognition (ASR), Large Language Model (LLM), and Text-to-Speech (TTS) pipelines. This approach reduces error propagation and information loss.
The system is built upon Qwen2.5-7B-Base, which has been adapted to handle interleaved sequences of continuous acoustic features and textual tokens. To bridge the encoder and the LLM, a specialized adapter uses three downsampling modules, reducing the frame rate from 50 Hz to 6.25 Hz. The tokenizer, based on WavLM-large, generates discrete audio tokens at 25 Hz using a codebook size of 16,384. The decoder, utilizing a Flow-Matching (FM) based framework and a BigVGAN vocoder, reconstructs high-fidelity 24K waveforms.
Hierarchical Tri-modal Speech-Text Interleaving Strategy
A core innovation in this work is the Hierarchical Tri-modal Speech-Text Interleaving strategy. This framework aligns continuous acoustic features (ac), discrete speech tokens (ad), and natural language text (t). It operates through two primary patterns: Sequential Interleaving (ac → t → ad), where features, text, and tokens form a progressive chain, and Parallel Integration (ac → t | ad), where continuous features align with a coupled text-discrete unit. The hierarchical aspect ensures structural coherence via phrase-level interleaving for fine-grained alignment and sentence-level interleaving to preserve global semantic integrity in long-form utterances.
The training process involved a two-stage pre-training pipeline that processed a total of 2T tokens. To address the challenge of creating large-scale dialogue data for specific speakers, the research team introduced an Intelligence Speaker Decoupling strategy. This technique separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal TTS data by reformulating high-quality TTS recordings into pseudo-conversations with masked text loss, thus preserving reasoning abilities while inheriting speaker naturalness.
Covo-Audio-Chat-FD Enhances Conversational Capabilities and Performance Metrics
Covo-Audio has evolved into Covo-Audio-Chat-FD, a variant supporting simultaneous listening and speaking. This version employs specific architectural tokens—THINK, SHIFT, and BREAK—to manage complex real-time dynamics, including smooth turn-taking, backchanneling, and user barge-ins. The audio encoder operates in a chunk-streaming manner, with user and model streams interleaved in a 1:4 ratio, where each chunk represents 0.16s of audio.
Despite its compact 7B scale, Covo-Audio demonstrates competitive or superior results on several benchmarks. On the MMAU benchmark, it achieved an average score of 75.30%, the highest among evaluated 7B-scale models, excelling in music understanding with a score of 76.05%. The MMSU benchmark saw Covo-Audio achieve a leading 66.64% average accuracy. For empathetic interaction on the VStyle benchmark in Mandarin, it achieved state-of-the-art results for anger (4.89), sadness (4.93), and anxiety (5.00). Covo-Audio-Chat also showed strong performance on URO-Bench, outperforming models like Qwen3-Omni on the Chinese track in speech reasoning and spoken dialogue tasks. The model frequently matches or exceeds the performance of larger systems, such as 32B-parameter models, in audio and speech understanding tasks.
✨ Intelligent Curation Note
This article was processed by AI Universe’s Intelligent Curation system. We’ve decoded complex technical jargon and distilled dense data into this high-impact briefing.
Estimated time saved: ~3 minutes of reading.
Tools We Use for Working with AI:








