Smart City Gnosys

Smart city article details

Title All You Can Embed: Natural Language Based Vehicle Retrieval With Spatio-Temporal Transformers
ID_Doc 7243
Authors Scribano C.; Sapienza D.; Franchini G.; Verucchi M.; Bertogna M.
Year 2021
Published IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
DOI http://dx.doi.org/10.1109/CVPRW53098.2021.00481
Abstract Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021. © 2021 IEEE.
Author Keywords


Similar Articles


Id Similarity Authors Title Published
39699 View0.884Du Y.; Zhang B.; Ruan X.; Su F.; Zhao Z.; Chen H.Omg: Observe Multiple Granularities For Natural Language-Based Vehicle RetrievalIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2022-June (2022)
57381 View0.878Sebastian C.; Imbriaco R.; Meletis P.; Dubbelman G.; Bondarev E.; De With P.H.N.Tied: A Cycle Consistent Encoder-Decoder Model For Text-To-Image RetrievalIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2021)
38752 View0.877Alzubi T.M.; Mukhtar U.R.Mvr: Synergizing Large And Vision Transformer For Multimodal Natural Language-Driven Vehicle RetrievalIEEE Access, 13 (2025)
47389 View0.866Sadiq T.; Omlin C.W.Scene Retrieval In Traffic Videos With Contrastive Multimodal LearningProceedings - International Conference on Tools with Artificial Intelligence, ICTAI (2023)