Smart City Gnosys

Smart city article details

Title Omg: Observe Multiple Granularities For Natural Language-Based Vehicle Retrieval
ID_Doc 39699
Authors Du Y.; Zhang B.; Ruan X.; Su F.; Zhao Z.; Chen H.
Year 2022
Published IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2022-June
DOI http://dx.doi.org/10.1109/CVPRW56347.2022.00352
Abstract Retrieving tracked-vehicles by natural language descriptions plays a critical role in smart city construction. It aims to find the best match for the given texts from a set of tracked vehicles in surveillance videos. Existing works generally solve it by a dual-stream framework, which consists of a text encoder, a visual encoder and a cross-modal loss function. Although some progress has been made, they failed to fully exploit the information at various levels of granularity. To tackle this issue, we propose a novel framework for the natural language-based vehicle retrieval task, OMG, which Observes Multiple Granularities with respect to visual representation, textual representation and objective functions. For the visual representation, target features, context features and motion features are encoded separately. For the textual representation, one global embedding, three local embeddings and a color-type prompt embedding are extracted to represent various granularities of semantic features. Finally, the overall framework is optimized by a cross-modal multi-granularity contrastive loss function. Experiments demonstrate the effectiveness of our method. Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are available at https://github.com/dyhBUPT/OMG. © 2022 IEEE.
Author Keywords


Similar Articles


Id Similarity Authors Title Published
12878 View0.923Bo X.; Liu J.; Yang D.; Ma W.Bridging The Gap: Multi-Granularity Representation Learning For Text-Based Vehicle RetrievalComplex and Intelligent Systems, 11, 1 (2025)
38752 View0.908Alzubi T.M.; Mukhtar U.R.Mvr: Synergizing Large And Vision Transformer For Multimodal Natural Language-Driven Vehicle RetrievalIEEE Access, 13 (2025)
57381 View0.885Sebastian C.; Imbriaco R.; Meletis P.; Dubbelman G.; Bondarev E.; De With P.H.N.Tied: A Cycle Consistent Encoder-Decoder Model For Text-To-Image RetrievalIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2021)
7243 View0.884Scribano C.; Sapienza D.; Franchini G.; Verucchi M.; Bertogna M.All You Can Embed: Natural Language Based Vehicle Retrieval With Spatio-Temporal TransformersIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2021)
47389 View0.881Sadiq T.; Omlin C.W.Scene Retrieval In Traffic Videos With Contrastive Multimodal LearningProceedings - International Conference on Tools with Artificial Intelligence, ICTAI (2023)