Smart City Gnosys

Smart city article details

Title Scene Retrieval In Traffic Videos With Contrastive Multimodal Learning
ID_Doc 47389
Authors Sadiq T.; Omlin C.W.
Year 2023
Published Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI
DOI http://dx.doi.org/10.1109/ICTAI59109.2023.00153
Abstract Retrieval of scenes from traffic videos is an important task in intelligent transportation systems (ITS) for efficient traffic management in AI smart cities. This work proposes natural language-based vehicle retrieval from traffic monitoring videos, emphasizing the significance of temporal information and context. We present contrastive learning as a technique to optimize joint representations of vision and language modalities within a shared latent representation space. The approach involves training contrastive losses to keep similar encodings closer in joint feature representation space by minimizing the distance between positive visual-text pairs and maximizing the distance between negative visual-text pairs. Our study employs state-of-the-art vision models for visual encoding and transformer-based language models for text encoding. We analyze the impact of feature selection from visuals and text on retrieval performance. We evaluate the efficacy of our proposed method on the AI City Challenge 2022 dataset for natural language-based vehicle retrieval, achieving performance accuracy of 49.84% Mean Reciprocal Rank (MRR) on the test dataset, securing second position on the leader board. Our approach highlights the effectiveness of feature selection and contrastive learning for enhancing multimodal retrieval tasks. © 2023 IEEE.
Author Keywords Computer vision; contrastive multimodal learning; intelligent transportation systems; neural networks


Similar Articles


Id Similarity Authors Title Published
38752 View0.916Alzubi T.M.; Mukhtar U.R.Mvr: Synergizing Large And Vision Transformer For Multimodal Natural Language-Driven Vehicle RetrievalIEEE Access, 13 (2025)
12878 View0.895Bo X.; Liu J.; Yang D.; Ma W.Bridging The Gap: Multi-Granularity Representation Learning For Text-Based Vehicle RetrievalComplex and Intelligent Systems, 11, 1 (2025)
39699 View0.881Du Y.; Zhang B.; Ruan X.; Su F.; Zhao Z.; Chen H.Omg: Observe Multiple Granularities For Natural Language-Based Vehicle RetrievalIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2022-June (2022)
57381 View0.877Sebastian C.; Imbriaco R.; Meletis P.; Dubbelman G.; Bondarev E.; De With P.H.N.Tied: A Cycle Consistent Encoder-Decoder Model For Text-To-Image RetrievalIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2021)
7243 View0.866Scribano C.; Sapienza D.; Franchini G.; Verucchi M.; Bertogna M.All You Can Embed: Natural Language Based Vehicle Retrieval With Spatio-Temporal TransformersIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2021)