MUSS: multi-cue-guided semi-supervised target speaker speech separation framework

Multi-Cue Guided Semi-Supervised Learning

toward Target Speaker Separation in Real Environments

Authors: Jiaming Xu, Jian Cui, Yunzhe Hao, Bo Xu

Abstract: To solve the cocktail party problem in real multi talker environments, this paper proposed a multi-cue guided semi-supervised target speaker separation method (MuSS). Our MuSS integrates three target speaker-related cues, including spatial, visual, and voiceprint cues. Under the guidance of the cues, the target speaker is separated into a predefined output channel, and the interfering sources are separated into other output channels with the optimal permutation. Both synthetic mixtures and real mixtures are utilized for semi-supervised training. Specifically, for synthetic mixtures, the separated target source and other separated interfering sources are trained to reconstruct the ground-truth references, while for real mixtures, the mixture of two real mixtures is fed into our separation model, and the separated sources are remixed to reconstruct the two real mixtures. Besides, in order to facilitate finetuning and evaluating the estimated source on real mixtures, we introduce a real multi-modal speech separation dataset, RealMuSS, which is collected in real-world scenarios and is comprised of more than one hundred hours of multi-talker mixtures with high-quality pseudo references of the target speakers. Experimental results show that the pseudo references effectively improve the finetuning efficiency and successfully evaluate the estimated speech on real mixtures, and various cue-driven separation models are greatly improved in signal-to-noise ratio and speech recognition accuracy under our semi-supervised learning framework.

Our Proposed Method

The figure below is our proposed multi-cue guided semi-supervised target speaker separation architecture. The separation model is guided by three speaker-related cues including spatial, visual and voiceprint cues. The training procedure is divided into three stages, where Stage 1 and Stage 2 are supervised learning on synthetic mixtures, and Stage 3 is semi-supervised learning on real mixtures. In Stage 3, different proportions of pseudo references would be fed to finetune the target source.

Demos (See More Samples)

Description: There are 15 models in total, including our MuSS with 10 possible cue combinations, and 5 baseline models. The predicted wavforms and evaluations in WER (%) and SI-SDR (dB) of English and Mandarin mixtures are as follows:

Note: VP: Voiceprint Cue VIS: Visual Cue SP: Spatial Cue BI: Binaural Input