Online Self-Attentive Gated RNNs for Real-Time Speaker Separation

This post presents "Online Self-Attentive Gated RNNs for Real-Time Speaker Separation", a streaming deep model for speaker voice separation that works and in real-time. Tested for monaural and binaural settings.

We transform a state-of-the-art separation model to operate causally and in an online streaming manner on audio speech. We evaluate the performance of the model when working online on a stream of data against an offline model. We compare the performance of the proposed model to several baseline methods under anechoic, noisy, and noisy-reverberant recording conditions while exploring both monaural and binaural inputs and outputs. Our findings shed light on the relative difference between causal and non-causal models when performing separation. Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB for binaural inputs while reaching a real-time factor of 0.65.

Results - binaural

Results - monaural

Following are samples from the model on different test datasets:

Mixture input - original mixed audio
Ground Truth - original separated samples
Offline - A causal SAGRNN [1] model that transforms a whole audio file at once (without online streaming)
Stateless - A naive implementation of the causal SAGRNN to work in streaming mode, where the model does not keep internal state between calls
Statful - An implementation of the causal SAGRNN to work in streaming mode, where the model keeps an internal state between calls

* For more information about the stateless and stateful streaming modes of operation, see our paper at Section 4. (Method).