| Abstract |
Accurate and efficient vehicle detection is critical to the safety guarantee, traffic optimization, and intelligent transportation development. However, the strong detection under possible occlusion, dynamic illumination, and sizeable intra-class shape and size variation of the vehicles is still very challenging. Conventional object detection systems are not usually accurate in intricate cityscapes, as false alarms and undetected objects tend to render such systems unreliable. This work introduces a new deep learning model, i.e., focal-pooling vision transformer (FoPViT), to address such an issue by combining the advantages of focal transformers and pooling-based vision transformers. To a large degree, it helps integrate the gradient-aware pooling tuner (GAPT), a new mechanism capable of dynamically adjusting the sizes of pooling kernels about gradient signals produced during training. The adaptive policy facilitates the effective extraction of features over scales, thus maintaining fine details and correct detection for vehicles with different sizes and orientations. The innovation of the new model is that it is two-sided: on the one hand, focal attention excludes the non-vital areas, and on the other hand, GAPT optimizes spatial feature pooling for more accurate results with less computational cost. The proposed system expands the vehicle detection models’ functionality and provides a new standard for intelligent detection systems in moving scenes. The proposed model reflects improved accuracy in detecting vehicles. The experimental outcomes show that FoPViT is 98% accurate with precision, recall, and F1-score rates at 97.5%, 97.8%, and 98%, respectively. Regarding efficiency, the model processes inference in 15 ms and can be used in real time. © The Author(s), under exclusive licence to Springer Nature Switzerland AG 2025. |