Submitted as a contribution paper at ICASSP 2026
We show DroFiT's speech enhancement demo in 3 different UAV (Drone) noise SNR settings of {-25, -15, -5} dB.
For drone noise data, we drove DJI Flip UAV [1] in the air to generate various types of noise.
For speech dataset, clean utterances of male/female from the VoiceBank-DEMAND [2] corpus derived from VoiceBank/VCTK [3].
| Module | Speaker | Allocation | SNR / Noise* |
|---|---|---|---|
| Train |
Total 28 Speakers (VoiceBank-DEMAND) |
Total 7,200 (≈257 per speaker) |
-25, -10, -15, -20, -25 dB |
| Validation | Same as Train |
Total 800 (≈29 per speaker) |
-25, -10, -15, -20, -25 dB |
| Test |
2 Unseen speakers: p232, p257 (1 Male / 1 Female – VCTK) |
Total 800 (135 per SNR) |
-25, -10, -15, -20, -25, -30 dB |
| Demo in this website |
4 Unseen Speakers (2 Male / 2 Female – VCTK) |
-25, -15, -5 dB |
* Noise segments were randomly sampled from 9000s recordings to mitigate temporal correlation, and mixtures were generated using PyRoomAcoustics.
For transformer and TCN block parameters and values used for test & demo, please see the table below.
| Module | Hyperparameter | Value | Description |
|---|---|---|---|
| Transformer (MHA) |
Model Dim | 32 | Along Frequency Axis |
| # of Heads | 16 | ||
| # of Layers | 4 | ||
| FFN Dim | 128 | ||
| TCN Block | Kernel size | 3 | Along Time Axis |
| # of Layers | 3 | ||
| Look-ahead | 200 ms | ||
| Receptive field | 15 |
Here are videos of the drone generating noises.
| Drone approaching and moving around MIC |
|
|
Please increase your speaker volume as speech content can be inaudible due to low SNR raw noisy input normalized to maximum volume player supports.
| SNR [dB] | Noisy Audio | DroFiT |
|---|---|---|
| -5 | ||
| -15 | ||
| -25 |
| SNR [dB] | Noisy Audio | DroFiT |
|---|---|---|
| -5 | ||
| -15 | ||
| -25 |
| SNR [dB] | Noisy Audio | DroFiT |
|---|---|---|
| -5 | ||
| -15 | ||
| -25 |
| SNR [dB] | Noisy Audio | DroFiT |
|---|---|---|
| -5 | ||
| -15 | ||
| -25 |
| SNR [dB] | Noisy Audio | DroFiT |
|---|---|---|
| -5 | ||
| -15 | ||
| -25 |
| SNR [dB] | Noisy Audio | DroFiT |
|---|---|---|
| -5 | ||
| -15 | ||
| -25 |
| SNR [dB] | Noisy Audio | DroFiT |
|---|---|---|
| -5 | ||
| -15 | ||
| -25 |
| SNR [dB] | Noisy Audio | DroFiT |
|---|---|---|
| -5 | ||
| -15 | ||
| -25 |
[1] DJI Flip UAV : https://www.dji.com/kr/flip/
[2] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Noisy speech database for training speech enhancement algorithms and TTS models,” dataset, 2017.
[3] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” dataset, 2019.
[4] Z. W. Tan, and A. W. H. Khong, “SMoLnet-T: An efficient complex-spectral mapping speech enhancement approach with frame-wise CNN and spectral combination transformer for drone audition,” Proc. APSIPA ASC, 2024.
[5] H. S. Choi, et al., “Phase-aware speech enhancement with deep complex U-Net,” in Proc. Int. Conf. Learning Representations, 2019.