Overview

We show DroFiT's speech enhancement demo in 3 different UAV (Drone) noise SNR settings of {-25, -15, -5} dB.

For drone noise data, we drove DJI Flip UAV [1] in the air to generate various types of noise.

For speech dataset, clean utterances of male/female from the VoiceBank-DEMAND [2] corpus derived from VoiceBank/VCTK [3].

Dataset configuration for training, validation, test, and demo sets.
Module	Speaker	Allocation	SNR / Noise*
Train	Total 28 Speakers (VoiceBank-DEMAND)	Total 7,200 (≈257 per speaker)	-25, -10, -15, -20, -25 dB
Validation	Same as Train	Total 800 (≈29 per speaker)	-25, -10, -15, -20, -25 dB
Test	2 Unseen speakers: p232, p257 (1 Male / 1 Female – VCTK)	Total 800 (135 per SNR)	-25, -10, -15, -20, -25, -30 dB
Demo in this website	4 Unseen Speakers (2 Male / 2 Female – VCTK)		-25, -15, -5 dB

* Noise segments were randomly sampled from 9000s recordings to mitigate temporal correlation, and mixtures were generated using PyRoomAcoustics.

For transformer and TCN block parameters and values used for test & demo, please see the table below.

Model hyperparameter configuration.
Module	Hyperparameter	Value	Description
Transformer (MHA)	Model Dim	32	Along Frequency Axis
	# of Heads	16
	# of Layers	4
	FFN Dim	128
TCN Block	Kernel size	3	Along Time Axis
	# of Layers	3
	Look-ahead	200 ms
	Receptive field	15

References

[1] DJI Flip UAV : https://www.dji.com/kr/flip/

[2] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Noisy speech database for training speech enhancement algorithms and TTS models,” dataset, 2017.

[3] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” dataset, 2019.

[4] Z. W. Tan, and A. W. H. Khong, “SMoLnet-T: An efficient complex-spectral mapping speech enhancement approach with frame-wise CNN and spectral combination transformer for drone audition,” Proc. APSIPA ASC, 2024.

[5] H. S. Choi, et al., “Phase-aware speech enhancement with deep complex U-Net,” in Proc. Int. Conf. Learning Representations, 2019.

SNR [dB]	Noisy Audio	DroFiT
-5
-15
-25

SNR [dB]	Noisy Audio	DroFiT
-5
-15
-25

SNR [dB]	Noisy Audio	DroFiT
-5
-15
-25

SNR [dB]	Noisy Audio	DroFiT
-5
-15
-25

SNR [dB]	Noisy Audio	DroFiT
-5
-15
-25

DroFiT: A Lightweight Band-Fused Frequency Attention Toward Real-Time UAV Speech Enhancement

Direct Link to speech enhancement audio demo results: here

Overview

Drone Noise Recording

Speech Enhancement Audio Demo (with STFT)

Female Speaker 1

Male Speaker 1

Female Speaker 2

Male Speaker 2

References