DroFiT: A Lightweight Band-Fused Frequency Attention Toward Real-Time UAV Speech Enhancement

Submitted as a contribution paper at ICASSP 2026


Direct Link to speech enhancement audio demo results: here


Overview

  We show DroFiT's speech enhancement demo in 3 different UAV (Drone) noise SNR settings of {-25, -15, -5} dB.

For drone noise data, we drove DJI Flip UAV [1] in the air to generate various types of noise.

For speech dataset, clean utterances of male/female from the VoiceBank-DEMAND [2] corpus derived from VoiceBank/VCTK [3].

Dataset configuration for training, validation, test, and demo sets.
Module Speaker Allocation SNR / Noise*
Train Total 28 Speakers
(VoiceBank-DEMAND)
Total 7,200
(≈257 per speaker)
-25, -10, -15, -20, -25 dB
Validation Same as Train Total 800
(≈29 per speaker)
-25, -10, -15, -20, -25 dB
Test 2 Unseen speakers: p232, p257
(1 Male / 1 Female – VCTK)
Total 800
(135 per SNR)
-25, -10, -15, -20, -25, -30 dB
Demo in this website 4 Unseen Speakers
(2 Male / 2 Female – VCTK)
-25, -15, -5 dB

* Noise segments were randomly sampled from 9000s recordings to mitigate temporal correlation, and mixtures were generated using PyRoomAcoustics.

For transformer and TCN block parameters and values used for test & demo, please see the table below.

Model hyperparameter configuration.
Module Hyperparameter Value Description
Transformer
(MHA)
Model Dim 32 Along Frequency Axis
# of Heads 16
# of Layers 4
FFN Dim 128
TCN Block Kernel size 3 Along Time Axis
# of Layers 3
Look-ahead 200 ms
Receptive field 15

Drone Noise Recording

  Here are videos of the drone generating noises.

Drone approaching and moving around MIC


Floating still
Floating still close-by

Speech Enhancement Audio Demo (with STFT)

  Please increase your speaker volume as speech content can be inaudible due to low SNR raw noisy input normalized to maximum volume player supports.


Female Speaker 1

SNR [dB] Noisy Audio DroFiT
-5
-15
-25

SNR [dB] Noisy Audio DroFiT
-5
-15
-25

Male Speaker 1

SNR [dB] Noisy Audio DroFiT
-5
-15
-25

SNR [dB] Noisy Audio DroFiT
-5
-15
-25

Female Speaker 2

SNR [dB] Noisy Audio DroFiT
-5
-15
-25

SNR [dB] Noisy Audio DroFiT
-5
-15
-25

Male Speaker 2

SNR [dB] Noisy Audio DroFiT
-5
-15
-25

SNR [dB] Noisy Audio DroFiT
-5
-15
-25


References

[1] DJI Flip UAV : https://www.dji.com/kr/flip/

[2] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Noisy speech database for training speech enhancement algorithms and TTS models,” dataset, 2017.

[3] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” dataset, 2019.

[4] Z. W. Tan, and A. W. H. Khong, “SMoLnet-T: An efficient complex-spectral mapping speech enhancement approach with frame-wise CNN and spectral combination transformer for drone audition,” Proc. APSIPA ASC, 2024.

[5] H. S. Choi, et al., “Phase-aware speech enhancement with deep complex U-Net,” in Proc. Int. Conf. Learning Representations, 2019.