Smart City Gnosys

Smart city article details

Title A Dynamic Cross-Modal Learning Framework For Joint Text-To-Audio Grounding And Acoustic Scene Classification In Smart City Environments
ID_Doc 1584
Authors Zhang Y.; Wu M.; Cai X.
Year 2025
Published Digital Signal Processing: A Review Journal, 167
DOI http://dx.doi.org/10.1016/j.dsp.2025.105444
Abstract As two fundamental components of smart city acoustic perception frameworks, Text-to-Audio Grounding (TAG) and Acoustic Scene Classification (ASC) demonstrate essential capabilities in enabling robust environmental monitoring and anomaly detection. However, existing methods typically treat these tasks independently, leading to increased system complexity and overlooking potential synergies between tasks. Although there has been progress in multi-task joint learning research, these methods are primarily limited to single audio modality and predefined event category libraries, lacking the ability to utilize multimodal information and struggling to meet the diversity requirements of complex acoustic scenes in open environments. This paper presents the first multimodal joint learning framework that integrates TAG with ASC, effectively addressing three significant challenges: cross-modal feature heterogeneity, global-local objective conflicts, and modal-task feature coupling, thereby achieving deep task collaboration. The core contributions of this work include designing an Adaptive Transformer with Scene-aware Fusion (ATSF) that optimizes audio-text cross-modal interaction through dual-modal feature decoupling and scene-adaptive recombination mechanisms; constructing a Multimodal Progressive Layered Expert Network (PLE) that suppresses negative transfer in multi-task learning through task-specific and shared knowledge separation strategies; and proposing a dynamic gradient-balanced joint optimization strategy to support efficient cross-modal multi-objective training. Experiments on the extended AudioGrounding dataset demonstrate that our framework significantly improves performance compared to single-task baseline models, with TAG task PSDS value increasing from 14.7 % to 36.83 % and ASC classification accuracy reaching 79.46 %. The proposed ATSF-PLE framework provides an efficient and precise solution for intelligent urban acoustic perception systems, demonstrating substantial application value in intelligent security, traffic management, and other scenarios. © 2025 Elsevier Inc.
Author Keywords Acoustic scene classification; Multi-modal feature disentanglement; Multi-task joint learning; Text-to-audio grounding


Similar Articles


Id Similarity Authors Title Published
35353 View0.9Tang H.; Hu Y.; Wang Y.; Zhang S.; Xu M.; Zhu J.; Zheng Q.Listen As You Wish: Fusion Of Audio And Text For Cross-Modal Event Detection In Smart CitiesInformation Fusion, 110 (2024)
1569 View0.869Fan X.; Khishe M.; Alqahtani A.; Alsubai S.; Alanazi A.; Mohamed Zaidi M.A Dual Adaptive Semi-Supervised Attentional Residual Network Framework For Urban Sound ClassificationAdvanced Engineering Informatics, 62 (2024)
58701 View0.865Goulão M.; Bandeira L.; Martins B.; L. Oliveira A.Training Environmental Sound Classification Models For Real-World Deployment In Edge DevicesDiscover Applied Sciences, 6, 4 (2024)
52318 View0.863Nogueira A.F.R.; Oliveira H.S.; Machado J.J.M.; Tavares J.M.R.S.Sound Classification And Processing Of Urban Environments: A Systematic Literature ReviewSensors, 22, 22 (2022)
47390 View0.861Zhang X.Scene Semiosis And Generative Intelligence: A Multimodal Recognition And Content Synthesis Framework Based On Optical Vision SensorsProceedings of SPIE - The International Society for Optical Engineering, 13682 (2025)
44289 View0.851Saradopoulos I.; Potamitis I.; Ntalampiras S.; Rigakis I.; Manifavas C.; Konstantaras A.Real-Time Acoustic Detection Of Critical Incidents In Smart Cities Using Artificial Intelligence And Edge NetworksSensors, 25, 8 (2025)
53206 View0.85Piadyk Y.; Rulff J.; Brewer E.; Hosseini M.; Ozbay K.; Sankaradas M.; Chakradhar S.; Silva C.Streetaware: A High-Resolution Synchronized Multimodal Urban Scene DatasetSensors, 23, 7 (2023)