SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
Abstract
SoloSpeech, a cascaded generative pipeline, improves target speech extraction and speech separation by addressing artifact introduction, naturalness reduction, and environment mismatches, achieving state-of-the-art intelligibility and quality.
Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FlowTSE: Target Speaker Extraction with Flow Matching (2025)
- Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction (2025)
- DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers (2025)
- FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching (2025)
- Unified Architecture and Unsupervised Speech Disentanglement for Speaker Embedding-Free Enrollment in Personalized Speech Enhancement (2025)
- C2/AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction (2025)
- LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper