Smart City Gnosys

Smart city article details

Title Listen As You Wish: Fusion Of Audio And Text For Cross-Modal Event Detection In Smart Cities
ID_Doc 35353
Authors Tang H.; Hu Y.; Wang Y.; Zhang S.; Xu M.; Zhu J.; Zheng Q.
Year 2024
Published Information Fusion, 110
DOI http://dx.doi.org/10.1016/j.inffus.2024.102460
Abstract In the era of smart cities, the advent of the Internet of Things technology has catalyzed the proliferation of multimodal sensor data, presenting new challenges in cross-modal event detection, particularly in audio event detection via textual queries. This paper focuses on the novel task of text-to-audio grounding (TAG), aiming to precisely localize sound segments that correspond to events described in textual queries within an untrimmed audio. This challenging new task requires multi-modal (acoustic and linguistic) information fusion as well as the reasoning for the cross-modal semantic matching between the given audio and textual query. Unlike conventional methods that often overlook the nuanced interactions between and within modalities, we introduce the Cross-modal Graph Interaction (CGI) model. This innovative approach leverages a language graph to model complex semantic relationships between query words, enhancing the understanding of textual queries. Additionally, a cross-modal attention mechanism generates snippet-specific query representations, facilitating fine-grained semantic matching between audio segments and textual descriptions. A cross-gating module further refines this process by emphasizing relevant features across modalities and suppressing irrelevant information, optimizing multimodal information fusion. Our comprehensive evaluation on the Audiogrounding benchmark dataset not only demonstrates the CGI model's superior performance over existing methods, but also underscores the significance of sophisticated multimodal interaction in improving the efficacy of TAG in smart cities. © 2024
Author Keywords Cross-modal learning; Graph neural network; Multimodal information fusion; Smart city; Text-to-audio grounding


Similar Articles


Id Similarity Authors Title Published
1584 View0.9Zhang Y.; Wu M.; Cai X.A Dynamic Cross-Modal Learning Framework For Joint Text-To-Audio Grounding And Acoustic Scene Classification In Smart City EnvironmentsDigital Signal Processing: A Review Journal, 167 (2025)
44289 View0.865Saradopoulos I.; Potamitis I.; Ntalampiras S.; Rigakis I.; Manifavas C.; Konstantaras A.Real-Time Acoustic Detection Of Critical Incidents In Smart Cities Using Artificial Intelligence And Edge NetworksSensors, 25, 8 (2025)
52316 View0.855Bello J.P.; Mydlarz C.; Salamon J.Sound Analysis In Smart CitiesComputational Analysis of Sound Scenes and Events (2017)