Online Self-Attentive Gated RNNs for Real-Time Speaker Separation



This post presents "Online Self-Attentive Gated RNNs for Real-Time Speaker Separation", a streaming deep model for speaker voice separation that works and in real-time. Tested for monaural and binaural settings.

We transform a state-of-the-art separation model to operate causally and in an online streaming manner on audio speech. We evaluate the performance of the model when working online on a stream of data against an offline model. We compare the performance of the proposed model to several baseline methods under anechoic, noisy, and noisy-reverberant recording conditions while exploring both monaural and binaural inputs and outputs. Our findings shed light on the relative difference between causal and non-causal models when performing separation. Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB for binaural inputs while reaching a real-time factor of 0.65.



Results - binaural


Italian Trulli


Results - monaural


Italian Trulli

Following are samples from the model on different test datasets:

  • Mixture input - original mixed audio
  • Ground Truth - original separated samples
  • Offline - A causal SAGRNN [1] model that transforms a whole audio file at once (without online streaming)
  • Stateless - A naive implementation of the causal SAGRNN to work in streaming mode, where the model does not keep internal state between calls
  • Statful - An implementation of the causal SAGRNN to work in streaming mode, where the model keeps an internal state between calls
* For more information about the stateless and stateful streaming modes of operation, see our paper at Section 4. (Method).



WSJ clean - binaural

Mixture input Ground Truth Offline Stateless Stateful


WSJ noisy - binaural

Mixture input Ground Truth Offline Stateless Stateful


WSJ noisey-reverberant - binaural

Mixture input Ground Truth Offline Stateless Stateful


WSJ clean - monaural

Mixture input Ground Truth Offline Stateless Stateful


WSJ noisy - monaural

Mixture input Ground Truth Offline Stateless Stateful


WSJ noisy-reverberant - monaural

Mixture input Ground Truth Offline Stateless Stateful

Code

Code will be available soon under the following repo: https://github.com/facebookresearch/svoice

References

[1]. Ke Tan et al. "SAGRNN: Self-Attentive Gated RNN for Binaural Speaker Separation with Interaural Cue Preservation"