| Title |
Mrpvt: A Novel Multiscale Rain-Cutting And Pooling Vision Transformer Model For Object Detection On Drone-Captured Scenarios |
| ID_Doc |
38029 |
| Authors |
Qiu Y.; Yang J.; Shao R.; Sha Q. |
| Year |
2024 |
| Published |
2024 2nd International Conference on Computer, Vision and Intelligent Technology, ICCVIT 2024 - Proceedings |
| DOI |
http://dx.doi.org/10.1109/ICCVIT63928.2024.10872469 |
| Abstract |
Drone imagery analysis is an essential part of smart city construction and management. Object detection in UAV images has been a popular research topic recently. The images captured by drones usually have huge scale changes due to the flying and shooting at different altitudes, which puts a burden on traditional deep learning networks. The Transformer-based network has proven successful in computer vision tasks due to its strong long sequence memory analysis capabilities. This article will design a network for object detection and analysis of drone images based on the Transformer structure. The main challenge with using transformers is the high computational cost of processing image sequences, so the common method is to use local attention instead of global attention and a hierarchical structure to compensate for the loss of global vision. However, the large amount of information lost caused by using local attention can not be made up for with this action simply. This paper analyzes the ideas of multiscale and hierarchical structure concepts and attention mechanisms, and embeds multiscale concepts into the local attention structure to further make up for the global view information lost. In order to achieve this, we used the pooling concept to build multiscale pooling layers combined with a pyramid structure and applied it to the multi-head self-attention module. At the same time, we precut the image features entering the pooling block to reduce the length of the image token sequence. While reducing the calculation amount of the vision transformer, the global visual field information is increased as much as possible to obtain better contextual features. On VisDrone dataset, MRPVT as a backbone achieves 47.98% AP50 and 25.20% AP75 accuracy and on UAVDT's car category achieves 98.84% AP50 and 89.27% AP75 accuracy. As expected, MRPVT delivers great performance on drone-captured images' object detection and visual aesthetic assessment. © 2024 IEEE. |
| Author Keywords |
drone-captured; efficient attention; multiscale pooling; object detection; Transformer |