CLC number: TP391.4
On-line Access: 2025-05-06
Received: 2024-01-27
Revision Accepted: 2024-06-27
Crosschecked: 2025-05-06
Cited: 0
Clicked: 1117
Citations: Bibtex RefMan EndNote GB/T7714
Lijian GAO, Qing ZHU, Yaxin SHEN, Qirong MAO, Yongzhao ZHAN. Dynamic prompting class distribution optimization for semi-supervised sound event detection[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(4): 556-567.
@article{title="Dynamic prompting class distribution optimization for semi-supervised sound event detection",
author="Lijian GAO, Qing ZHU, Yaxin SHEN, Qirong MAO, Yongzhao ZHAN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="4",
pages="556-567",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400061"
}
%0 Journal Article
%T Dynamic prompting class distribution optimization for semi-supervised sound event detection
%A Lijian GAO
%A Qing ZHU
%A Yaxin SHEN
%A Qirong MAO
%A Yongzhao ZHAN
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 4
%P 556-567
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400061
TY - JOUR
T1 - Dynamic prompting class distribution optimization for semi-supervised sound event detection
A1 - Lijian GAO
A1 - Qing ZHU
A1 - Yaxin SHEN
A1 - Qirong MAO
A1 - Yongzhao ZHAN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 4
SP - 556
EP - 567
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400061
Abstract: Semi-supervised sound event detection (SSED) tasks typically leverage a large amount of unlabeled and synthetic data to facilitate model generalization during training, reducing overfitting on a limited set of labeled data. However, the generalization training process often encounters challenges from noisy interference introduced by pseudo-labels or domain knowledge gaps. To alleviate noisy interference in class distribution learning, we propose an efficient semi-supervised class distribution learning method through dynamic prompt tuning, named prompting class distribution optimization (PADO). Specifically, when modeling real labeled data, PADO dynamically incorporates independent learnable prompt tokens to explore prior knowledge about the true distribution. Then, the prior knowledge serves as prompt information, dynamically interacting with the posterior noisy-class distribution information. In this case, PADO achieves class distribution optimization while maintaining model generalization, leading to a significant improvement in the efficiency of class distribution learning. Compared with state-of-the-art methods on the SSED datasets from DCASE 2019, 2020, and 2021 challenges, PADO achieves significant performance improvements. Furthermore, it is readily extendable to other benchmark models.
[1]Bilen Ç, Ferroni G, Tuveri F, et al., 2020. A framework for the robust evaluation of sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.61-65.
[2]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34th Int Conf on Neural Information Processing Systems, Article 159.
[3]Chan TK, Chin CS, 2021. Detecting sound events using convolutional macaron net with pseudo strong labels. Proc IEEE 23rd Int Workshop on Multimedia Signal Processing, p.1-6.
[4]Crocco M, Cristani M, Trucco A, et al., 2016. Audio surveillance: a systematic review. ACM Comput Surv, 48(4):52.
[5]Dinkel H, Wu MY, Yu K, 2021. Towards duration robust weakly supervised sound event detection. IEEE/ACM Trans Audio Speech Lang Process, 29:887-900.
[6]Fu YW, Xu KL, Mi HB, et al., 2019. A mobile application for sound event detection. Proc 28th Int Joint Conf on Artificial Intelligence, p.1-7.
[7]Gao LJ, Mao QR, Dong M, et al., 2019. On learning disentangled representation for acoustic event detection. Proc 27th ACM Int Conf on Multimedia, p.2006-2014.
[8]Gao LJ, Zhou L, Mao QR, et al., 2022. Adaptive hierarchical pooling for weakly-supervised sound event detection. Proc 30th ACM Int Conf on Multimedia, p.1779-1787.
[9]Gao LJ, Mao QR, Dong M, 2023. Joint-Former: jointly regularized and locally down-sampled Conformer for semi-supervised sound event detection. Proc 24th Annual Conf of the Int Speech Communication Association, p.2753-2757.
[10]Gao LJ, Mao QR, Dong M, 2024. On local temporal embedding for semi-supervised sound event detection. IEEE/ACM Trans Audio Speech Lang Process, 32:1687-1698.
[11]Gao TY, Fisch A, Chen DQ, 2021. Making pre-trained language models better few-shot learners. Proc 59th Annual Meeting of the Association for Computational Linguistics and 11th Int Joint Conf on Natural Language Processing, p.3816-3830.
[12]Gemmeke JF, Ellis DPW, Freedman D, et al., 2017. Audio Set: an ontology and human-labeled dataset for audio events. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.776-780.
[13]Gu YX, Han X, Liu ZY, et al., 2022. PPT: pre-trained prompt tuning for few-shot learning. Proc 60th Annual Meeting of the Association for Computational Linguistics, p.8410-8423.
[14]Gu ZD, He KJ, 2024. Affective prompt-tuning-based language model for semantic-based emotional text generation. Int J Semantic Web Inform Syst, 20(1):1-19.
[15]Guan YD, Xue JB, Zheng GB, et al., 2022. Sparse self-attention for semi-supervised sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.821-825.
[16]Gulati A, Qin J, Chiu CC, et al., 2020. Conformer: convolution-augmented Transformer for speech recognition. Proc 21st Annual Conf of the Int Speech Communication Association, p.5036-5040.
[17]Imoto K, Tonami N, Koizumi Y, et al., 2020. Sound event detection by multitask learning of sound events and scenes with soft scene labels. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.621-625.
[18]Jia ML, Tang LM, Chen BC, et al., 2022. Visual prompt tuning. Proc 17th European Conf on Computer Vision, p.709-727.
[19]Koh CY, Chen YS, Liu YW, et al., 2021. Sound event detection by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.376-380.
[20]Kong QQ, Xu Y, Wang WW, et al., 2020. Sound event detection of weakly labelled data with CNN-Transformer and automatic threshold optimization. IEEE/ACM Trans Audio Speech Lang Process, 28:2450-2460.
[21]Li YX, Liu ML, Drossos K, et al., 2020. Sound event detection via dilated convolutional recurrent neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.286-290.
[22]Lin LW, Wang XD, Liu H, et al., 2020. Guided learning for weakly-labeled semi-supervised sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.626-630.
[23]Mesaros A, Heittola T, Virtanen T, 2016. Metrics for polyphonic sound event detection. Appl Sci, 6(6):162.
[24]Mesaros A, Heittola T, Virtanen T, et al., 2021. Sound event detection: a tutorial. IEEE Signal Process Mag, 38(5):67-83.
[25]Miyazaki K, Komatsu T, Hayashi T, et al., 2020a. Conformer-based sound event detection with semi-supervised learning and data augmentation. Proc 5th Workshop on Detection and Classification of Acoustic Scenes and Events, p.100-104.
[26]Miyazaki K, Komatsu T, Hayashi T, et al., 2020b. Weakly-supervised sound event detection with self-attention. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.66-70.
[27]Murugesan B, Hussain R, Bhattacharya R, et al., 2024. Prompting classes: exploring the power of prompt class learning in weakly supervised semantic segmentation. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.290-301.
[28]Park JS, Kim SH, 2020. Sound learning-based event detection for acoustic surveillance sensors. Multimed Tools Appl, 79(23-24):16127-16139.
[29]Serizel R, Turpault N, Shah A, et al., 2020. Sound event detection in synthetic domestic environments. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.86-90.
[30]Singhal K, Azizi S, Tu T, et al., 2023. Large language models encode clinical knowledge. Nature, 620:172-180.
[31]Sohn K, Chang H, Lezama J, et al., 2023. Visual prompt tuning for generative transfer learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19840-19851.
[32]Tarvainen A, Valpola H, 2017. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. 31st Int Conf on Neural Information Processing Systems, p.1195-1204.
[33]Turpault N, Serizel R, Shah AP, et al., 2019. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. Workshop on Detection and Classification of Acoustic Scenes and Events, p.253-257.
[34]Turpault N, Wisdom S, Erdogan H, et al., 2020. Improving sound event detection in domestic environments using sound separation. 5th Workshop on Detection and Classification of Acoustic Scenes and Events, p.205-209.
[35]Wakayama K, Saito S, 2022. CNN-Transformer with self-attention network for sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.806-810.
[36]Wang YH, Chauhan J, Wang W, et al., 2023. Universality and limitations of prompt tuning. 37th Int Conf on Neural Information Processing Systems, Article 3305.
[37]Wisdom S, Erdogan H, Ellis DPW, et al., 2021. What’s all the fuss about free universal sound separation data? Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.186-190.
[38]Xu H, Xie HT, Tan QF, et al., 2023. Meta semi-supervised medical image segmentation with label hierarchy. Health Inform Sci Syst, 11(1):26.
[39]Yan J, Song Y, Dai LR, et al., 2020. Task-aware mean teacher method for large scale weakly labeled semi-supervised sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.326-330.
[40]Zheng X, Song Y, Dai LR, et al., 2021a. An effective mutual mean teaching based domain adaptation method for sound event detection. Proc 22nd Annual Conf of the Int Speech Communication Association, p.556-560.
[41]Zheng X, Song Y, McLoughlin I, et al., 2021b. An improved mean teacher based method for large scale weakly labeled semi-supervised sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.356-360.
Open peer comments: Debate/Discuss/Question/Opinion
<1>