| Abstract |
Multi-camera multi-object tracking is a critical task in various applications such as surveillance, autonomous driving, and smart cities, where accurate and robust tracking of multiple objects across different camera views is essential. This work presents a pipeline for multi-camera multi-object tracking that combines deep learning models, specifically YOLOX and OSNet, with traditional algorithms such as Hungarian algorithm and Kalman filter. A significant enhancement in this pipeline is the incorporation of the CLIP-Reid model for pedestrian feature extraction, leveraging the power of vision-language models. The proposed approach is evaluated by comparing the effectiveness of CLIP-Reid against traditional image-based feature extraction methods on two distinct camera sequences, Laboratory and Terrace, in the EPFL dataset, Multi-camera Pedestrian Videos. The results show a modest improvement in tracking performance, with an increase in IDF1 by 0.4% and MOTA by 0.9% on the Laboratory sequence, and an increase in IDF1 by 7.2% and MOTA by 0.2% on the Terrace sequence, indicating the potential for incremental accuracy gains in complex tracking scenarios. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025. |