| Abstract |
With the rapid advancement of smart city infrastructure and optical sensing technologies, robust multimodal scene understanding has become essential for urban perception systems. This paper presents a unified generative framework that integrates optical priors, acoustic features, and textual semantics to perform accurate scene classification and coherent content generation. Leveraging a cross-modal transformer architecture, the model aligns spatial and temporal information across modalities, enabling consistent interpretation even under incomplete input conditions. A masked attention mechanism enhances resilience to missing visual data, making the system suitable for real-world deployments such as edge surveillance and sensor-based media reporting. Evaluated on three public datasets - TVG-MED, EGOSTREAM, and VAST-MUSE - the proposed method achieves superior classification accuracy and maintains semantic fluency across degraded inputs. Experimental results demonstrate the model's effectiveness in handling partial modality dropout while preserving high performance in both recognition and generation tasks. This work contributes to the development of intelligent multimodal reasoning systems for smart environments, offering practical value in urban monitoring, public communication, and adaptive media synthesis. © COPYRIGHT SPIE. Downloading of the abstract is permitted for personal use only. |