This post presents "Online Self-Attentive Gated RNNs for Real-Time Speaker Separation", a streaming deep model for speaker voice separation that works and in real-time. Tested for monaural and binaural settings.
We transform a state-of-the-art separation model to operate causally and in an online streaming manner on audio speech. We evaluate the performance of the model when working online on a stream of data against an offline model. We compare the performance of the proposed model to several baseline methods under anechoic, noisy, and noisy-reverberant recording conditions while exploring both monaural and binaural inputs and outputs. Our findings shed light on the relative difference between causal and non-causal models when performing separation. Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB for binaural inputs while reaching a real-time factor of 0.65.
Following are samples from the model on different test datasets:
Mixture input | Ground Truth | Offline | Stateless | Stateful |
---|---|---|---|---|
Mixture input | Ground Truth | Offline | Stateless | Stateful |
---|---|---|---|---|
Mixture input | Ground Truth | Offline | Stateless | Stateful |
---|---|---|---|---|
Mixture input | Ground Truth | Offline | Stateless | Stateful |
---|---|---|---|---|
Mixture input | Ground Truth | Offline | Stateless | Stateful |
---|---|---|---|---|
Mixture input | Ground Truth | Offline | Stateless | Stateful |
---|---|---|---|---|
Code will be available soon under the following repo: https://github.com/facebookresearch/svoice
[1]. Ke Tan et al. "SAGRNN: Self-Attentive Gated RNN for Binaural Speaker Separation with Interaural Cue Preservation"