Sunday, September 8, 2024

The UAV Benchmark: Compact Detection of Vehicles in Urban Scenarios | IEEE Journals & Magazine | IEEE Xplore

Summary

A team of researchers at Wuhan University, Wuhan, China led by Hanjiang Xiong have defined a new baseline for detection and tracking of vehicles from UAV imagery of urban scenes. Here are the key points from the research paper on the UAV Benchmark for vehicle detection:

1. The authors propose a new dataset called PARA for vehicle detection in oblique UAV imagery of urban scenes.

2. They introduce a new annotation method called parallelogram bounding box (PBB) to more accurately represent vehicles with perspective distortion in oblique views.

3. The PARA dataset contains:
   - 1025 high-resolution UAV images
   - 117,122 manually annotated vehicle bounding boxes
   - Images from complex urban backgrounds at different angles
   - Vehicles categorized as static, dynamic, and pedestrians

4. They compare detection algorithms using horizontal bounding box (HBB) and PBB representations on PARA.

5. Key findings:
   - PBB representation outperforms HBB, especially for the stricter mAP75 metric
   - PBB better fits distorted vehicle shapes and reduces background noise
   - Detection is more challenging for static vehicles due to occlusion

6. The dataset aims to address limitations of existing aerial vehicle datasets by providing oblique views and a more flexible annotation method.

7. The authors establish baseline detection performance using mainstream object detection models adapted for the PBB representation.

8. They argue PARA will help advance vehicle detection in real-world UAV imagery with perspective distortion.

In summary, this paper introduces a new UAV vehicle detection benchmark with a novel annotation method to better handle oblique views, aiming to improve detection in challenging urban scenes.

UAV Benchmarks

UAV Benchmarks for vehicle detection are datasets and evaluation frameworks specifically designed to advance the development and testing of algorithms for detecting vehicles in images captured by Unmanned Aerial Vehicles (UAVs). Here are the key aspects of UAV Benchmarks for vehicle detection:

1. Purpose:
   - To provide a standardized dataset for training and evaluating vehicle detection algorithms in UAV imagery
   - To reflect real-world conditions and challenges faced in UAV-based vehicle detection

2. Key Components:
   - Large-scale dataset of UAV images
   - Annotated vehicle instances
   - Evaluation metrics (e.g., mAP50, mAP75)
   - Baseline detection results using state-of-the-art algorithms

3. Unique Characteristics:
   - High-resolution images
   - Various urban scenarios (e.g., roads, parking lots, intersections)
   - Different viewing angles and flight heights
   - Complex backgrounds
   - Diverse vehicle types, scales, and orientations

4. Annotation Methods:
   - The paper introduces a new Parallelogram Bounding Box (PBB) annotation
   - Traditional methods like Horizontal Bounding Box (HBB) are also used for comparison

5. Challenges Addressed:
   - Perspective distortion in oblique views
   - Dense object distributions
   - Varying scales of vehicles
   - Complex urban backgrounds
   - Occlusions and shadows

6. Evaluation Metrics:
   - Mean Average Precision (mAP) at different Intersection over Union (IoU) thresholds
   - Typically includes mAP50 and mAP75 to assess detection accuracy

7. Baseline Algorithms:
   - Adaptation of mainstream object detection algorithms (e.g., Faster R-CNN, RetinaNet)
   - Comparison of performance using different annotation methods (e.g., HBB vs. PBB)

8. Applications:
   - Traffic monitoring
   - Urban planning
   - Surveillance
   - Parking management

UAV Benchmarks for vehicle detection, like the PARA dataset introduced in this paper, aim to push the boundaries of computer vision and object detection in challenging real-world scenarios, particularly focusing on the unique aspects of UAV imagery such as oblique views and varying altitudes.

How Used

UAV Benchmarks like PARA would be valuable for UAV vision testing in several ways:

1. Algorithm Development and Evaluation:
   - Researchers and developers can use the dataset to train and test new computer vision algorithms specifically designed for UAV-based vehicle detection.
   - The benchmark provides a standardized way to compare different algorithms' performance.

2. Robustness Testing:
   - The diverse scenarios in the dataset allow testing of algorithms under various conditions (e.g., different lighting, angles, urban layouts).
   - This helps identify strengths and weaknesses of vision systems in real-world situations.

3. Perspective Handling:
   - The oblique views in the dataset challenge vision systems to handle perspective distortion, which is common in UAV imagery but less present in traditional datasets.
   - Algorithms can be fine-tuned to better handle these distorted views.

4. Scale Adaptation:
   - With images captured at different altitudes, the benchmark tests an algorithm's ability to detect vehicles at various scales.
   - This is crucial for UAVs that may operate at different heights during missions.

5. Performance Metrics:
   - The established evaluation metrics (like mAP50 and mAP75) provide quantitative measures of a system's detection accuracy.
   - This allows for objective comparison between different vision systems or iterations of the same system.

6. Real-world Simulation:
   - The complex urban backgrounds simulate real-world conditions, helping assess how well a vision system might perform in actual deployments.

7. Occlusion and Crowding Handling:
   - The dataset includes scenarios with occlusions and densely packed vehicles, testing a system's ability to distinguish individual vehicles in challenging conditions.

8. Annotation Method Comparison:
   - By providing both traditional (HBB) and new (PBB) annotation methods, the benchmark allows testing of which approach leads to better detection results in UAV scenarios.

9. Generalization Testing:
   - A vision system trained on other datasets can be tested on this benchmark to evaluate how well it generalizes to UAV-specific imagery.

10. Hardware-Software Integration:
    - The benchmark can be used to test not just software algorithms, but also how well they integrate with specific UAV hardware and cameras.

11. Operational Parameter Optimization:
    - By analyzing performance across different scenarios in the dataset, operators can optimize flight parameters (e.g., altitude, angle) for best detection results in various environments.

12. Continuous Improvement:
    - As a standardized benchmark, it allows for continuous testing and improvement of UAV vision systems over time, tracking progress in the field.

In essence, this benchmark provides a comprehensive testing ground that simulates real-world challenges in UAV-based vehicle detection, allowing developers to rigorously evaluate and improve their vision systems before deployment in actual UAV operations.

Authors and Institutions:


1. Haitao Lv
   - Affiliation: State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, China
   - Currently pursuing an M.S. degree

2. Xianwei Zheng (Member, IEEE)
   - Affiliation: Professor at Wuhan University, China
   - Research interests: indoor and outdoor scene parsing, 3-D computer vision and reconstruction, and geovisualization

3. Xiao Xie
   - Affiliation:
     - Key Lab of Environmental Computing and Sustainability, Liaoning, China
     - Assistant Professor at Institute of Applied Ecology, Chinese Academy of Sciences, Beijing, China
     - Postdoctoral Researcher at School of Geodesy and Geomatics, Wuhan University
   - Research interests: 3-D GIS and smart cities

4. Xueye Chen
   - Affiliation: Not explicitly stated, but involved in GIS, remote sensing, digital government, and smart city technologies

5. Hanjiang Xiong
   - Affiliation: Full Professor in 3-D GIS, Wuhan University, China
   - Research interests: geospatial data management, 3-D visualization, augmented reality, indoor and outdoor GIS

Artifacts Used and Generated:

1. PARA Dataset (generated):
   - 1025 high-resolution UAV images
   - 117,122 manually annotated vehicle bounding boxes
   - Images from various urban scenarios (roads, parking lots, crossings, buildings, highways)
   - Annotations in both Horizontal Bounding Box (HBB) and Parallelogram Bounding Box (PBB) formats

2. Annotation Tool (used/modified):
   - Developed a new annotation tool based on labelme for outlining parallelogram-like bounding boxes

3. Detection Algorithms (used/modified):
   - Modified versions of Faster RCNN and RetinaNet to work with PBB annotations
   - Various other mainstream object detection algorithms (e.g., DAB-DETR, Cascade RCNN, RTMDet, YOLOv3, SSD, EfficientNet, Deformable DETR, FCOS)

4. Evaluation Metrics (used):
   - Mean Average Precision (mAP), specifically mAP50 and mAP75

5. Benchmark Results (generated):
   - Performance comparisons of various detection algorithms on the PARA dataset
   - Analysis of PBB vs HBB annotation methods

6. Visualization Tools (used):
   - Tools for creating various charts and graphs to display data distribution and results

7. MMDetection Framework (used):
   - An open-source object detection toolbox used for implementing and evaluating the detection algorithms

The research primarily generated the PARA dataset and associated benchmark results, while using and modifying existing tools and algorithms to create and evaluate this new resource for UAV-based vehicle detection.

The UAV Benchmark: Compact Detection of Vehicles in Urban Scenarios

 H. Lv, X. Zheng, X. Xie, X. Chen and H. Xiong, "The UAV Benchmark: Compact Detection of Vehicles in Urban Scenarios," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 14836-14847, 2024, doi: 10.1109/JSTARS.2024.3443268.

Abstract: Vehicle detection in unmanned aerial vehicle (UAV) images is a fundamental task in photogrammetry and remote sensing. While great success has been achieved, this task remains challenging due to two aspects: 

  1. the limitation of existing annotation methods in compactly enclosing targets with large perspective distortions in oblique UAV images; 
  2. the lack of vehicle detection datasets under oblique perspectives. To this end, we propose an oblique UAV benchmark for the precise expression and localization of distorted vehicles in urban scenarios. 

The benchmark consists of 

  1. a new parallelogramlike bounding box (PBB) annotation for compactly representing vehicles in oblique UAV images; and 
  2. a large-scale UAV dataset (namely PARA) for vehicle detection with PBB representation. Our PBB representation frees the angle flexibility to allow a compact depiction of vehicles under various perspective distortion, thus overcoming the inherent limits of rectangular representation [like horizontal bounding box (HBB)] used in traditional annotation methods. 

PARA comprises 1025 high-resolution images and 117 122 manually annotated object bounding boxes obtained from different UAV platforms. The annotated images are collected from scenarios with complex urban backgrounds and different shooting angles to reflect real-world conditions. Moreover, we compared detection algorithms based on the mainstream HBB and PBB representations on the PARA dataset and established a baseline for UAV oblique image-based vehicle detection. Experimental results validate the effectiveness of PBB representation and highlight the challenges posed by PARA. 

keywords: {Autonomous aerial vehicles;Detectors;Annotations;Feature extraction;Shape;Remote sensing;Object detection;Benchmark testing;object detection;remote sensing;unmanned aerial vehicle (UAV) image},

URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10636261&isnumber=10330207


Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 17)

Page(s): 14836 - 14847
Date of Publication: 14 August 2024

ISSN Information:

Publisher: IEEE

SECTION I.

Introduction

The use of unmanned aerial vehicles (UAVs) to acquire high-resolution images has become an indispensable supplement to satellite remote sensing. In recent years, growing interest has turned to the detection of objects in UAV imagery, due to some attractive properties of UAVs, e.g., high flexibility, various views, and the ability to acquire both the top and side information of objects. Detecting vehicles from UAV images facilitates a variety of modern urban applications, ranging from crowd detection [1], [2], surveillance [3], [4], traffic monitoring [5], [6], search, and rescue [7], [8], etc. However, vehicles in UAV images are generally captured in an oblique view, which often suffer from drastic perspective deformation [9]. Hence, the detection of UAV vehicles faces not only challenges of arbitrary orientation that are common in ground or satellite images, but also the issue of notable shape and appearance distortion of objects.

Over the past decades, the rapid development of deep learning and its success in computer vision have attracted increased attention to precisely locating and representing vehicles in aerial images [10], [11], [12], [13], [14], [15], [16]. In the early stages, many deep detectors adopt horizontal bounding box (HBB) for vehicle detection in natural imagery due to its simplicity and low cost. These detectors can generally yield satisfactory results in scenes with sparse objects and a relatively simple background. However, when applying these detectors to remote sensing scenes where the object instances are densely crowded and arbitrary-oriented, especially in urban areas with high occlusion, their performances are often dramatically degraded. Inspired by oriented text detection benchmarks [17], [18], the oriented bounding box (OBB) was then introduced to address the challenge of detecting crowded objects in aerial images. OBB employs an additional parameter Θ to the HBB representation to describe the orientation of remotely sensed objects in aerial images.

Nevertheless, unlike the nearly vertical shooting angles of orthographic aerial images, the oblique views of UAV images inevitably bring large geometric distortions to the captured objects. As a result, rectangle-based annotation methods like HBB and OBB are incapable of precisely and compactly enclosing the vehicles in UAV images and often introduce extraneous background noise into learned regional features. A comparison between different annotation methods is shown in Fig. 1(a) and (b). It is clear that the rectangle-based annotation methods fail to correctly delineate the shape of vehicles and include much redundant information into the bounding boxes in oblique UAV image. Although some researchers [19] have employed mask segmentation to achieve more precise vehicle representations in low-altitude UAV images, this annotation significantly increases costs in terms of target annotations and network parameters.

Fig. 1. - Comparison between different vehicle annotation methods. (a), (b), and (c) represent HBB, RBB, and our PBB annotation of the same vehicle in the oblique UAV image, respectively.
Fig. 1.

Comparison between different vehicle annotation methods. (a), (b), and (c) represent HBB, RBB, and our PBB annotation of the same vehicle in the oblique UAV image, respectively.

Except for the inaccurate representation of vehicles, the lack of available datasets for training vehicle detectors for oblique UAV images is another obstacle in this research field. Although several vehicle detection datasets have been developed on aerial images [20], [21], [22], [23], [24], [25], [26], they still have difficulty meeting the requirements of practical applications. In the early period, vehicle datasets [20], [21] had a limited number of instances and low spatial resolution, which restricted their applicability to the detection algorithms. With the fast development of sensor technology, vehicle datasets such as DLR-3K [22], HighD [23], and CARPK [26] began to focus on single or simple natural scenes with high-resolution images. The contained images were collected by low-altitude UAVs on highways or parking lots. However, most of their annotated images are captured in ideal conditions (clear simple backgrounds and without crowded instances), which are inadequate to reflect the complex real-world scenes. To remedy this problem, several large-scale vehicle datasets, such as COWC [24], UCAS-AOD [25], and HRRSD [27], were proposed, which involve more complex backgrounds and a larger number of targets. Nevertheless, these datasets often observe objects from a nearly vertical view, making it difficult to reflect the characteristics of objects in the oblique view images. In contrast to the single perspective of natural and aerial images, the views of UAVs could be varied when the shooting pose changes. As a result, the objects in oblique UAV images frequently suffer from large perspective distortion, posing a significant challenge to the detection algorithms. Due to the lack of annotated UAV images, many methods [28], [29], [30] rely on transfer learning with large-scale natural image datasets (such as ImageNet [31], COCO [32], and Microsoft VOC [33]) for vehicle detection in UAV imagery. Unfortunately, the bias between the UAV datasets and natural datasets makes it hard to learn useful features for UAV objects from natural samples.

Based on the above analysis, we propose a new parallelogramlike representation, namely, parallelogram bounding box (PBB), to compactly enclose vehicles in UAV images. As shown in Fig. 1(c), our PBB can fit well with the shape and orientation of vehicles in UAV images. As a result, the image region contained by our PBB can better reflect the appearance and size of vehicles under the oblique views compared with conventional annotation methods. This will bring benefits to the learning of more target-related features for vehicle detection in UAV images, as the interference of backgrounds is greatly reduced. We also observe that rectangular vehicles struggle to keep axis-aligned under the oblique views and the parallelogramlike representation can effectively encode the centrosymmetric shape of objects in UAV imagery.

To facilitate the precise detection of vehicles from oblique UAV images with PBB, we further propose a corresponding UAV dataset called PARA for training deep detectors. We collected 1025 UAV images from complex urban scenarios captured by different sensors and platforms. The images in PARA contain vehicles of different appearances, scales, and orientations, with sizes ranging from 1280 × 1280 pixels to 4000 × 6000 pixels. All images in PARA are manually annotated by experts in image interpretation with a total of 117 122 PBBs and HBBs. In addition, PARA has some distinctive properties: 1) containing massive vehicle instances under oblique perspectives, which effectively supplements the lack of samples under the multiview observations; and 2) using a novel annotation method that yields a compact vehicle representation that accurately reflects the size and shape of vehicles. We evaluate several mainstream detection algorithms on PARA to illustrate the effectiveness of PBB and build a baseline for vehicle detection in oblique UAV images. The contributions of this work mainly lie in three aspects.

  1. We proposed a new PBB representation for compact and precise vehicle detection from oblique UAV images. PBB can better fit the shape and orientation of vehicle objects under large perspective deformation than existing HBB and RBB representation.

  2. We built an UAV dataset (PARA) to facilitate the detection of vehicles with PBB representation, which contains a large amount of images reflecting the oblique nature of UAV data in real-world conditions. In addition, we also divided vehicles into two categories—dynamic vehicles and static vehicles as a complement to the existing vehicle datasets.

  3. We evaluated several mainstream object detection methods on PARA and proposed a baseline detector, which can provide references for subsequent research on vehicle detection.

SECTION II.

Related Work

A. Vehicle Detection Method

Vehicle detection is a fundamental subtask in object detection, which aims at locating and classifying vehicles in various types of scene images. With the rapid development of modern detectors, vehicle detection has made significant progress in recent years. To date, the most widely used detectors are deep learning-based ones, which can be divided into two-stage detectors and one-stage detectors. The two-stage detectors first generate candidate regions and then extract the regional features to regress the final bounding box for the target object. These detectors are typically built upon CNN-based models, like RCNN [34], Fast RCNN [35], and Faster RCNN [36]. To enhance the ability of the network for detecting small targets, FPN [37] aggregates multiscale semantic information using a feature pyramid network to enhance the robustness of the detectors. Cascade RCNN [38] utilizes a multistage network with different IoU thresholds to achieve more precise detection.

Unlike the two-stage detectors, the one-stage detectors treat object detection as a regression problem and use a single-stage network to directly predict the class and position of the target object. The representative one-stage detectors are YOLO series [39], [40], [41], [42], and SSD [43]. Other detectors, like DSSD [44] and FSSD [45], explore how to better fuse multiscale features to improve the detection accuracy of small targets. To solve the problem of extreme imbalance between foreground and background bounding boxes in object detection, RetinaNet [46] introduces focal loss to force the network to pay more attention to hard samples.

Recent advancements in remote sensing have made aerial images with wide coverage and a large number of ground objects widely available. Many studies are devoted to detecting objects with arbitrary orientations in aerial images [47], [48], [49], [50], [51], [52], [53], [54], [55]. In the early stages, many conventional detection methods were developed and modified to predict rotated objects in aerial images. For instance, Rotated Faster RCNN [47] and R2CNN [48] adjust the original Faster RCNN [36] to predict the rotated bounding box (RBB) of the aerial objects. However, limited by the original RPN network, which can only generate horizontal candidate regions, most detectors cannot achieve satisfactory performance in rotated object detection. To solve this problem, several works have been presented to modify the original RPN network [49], [50]. In this line of research, RRPN [18] generates prior rotated boxes of various sizes and aspect ratios onto the feature maps and then feeds prior boxes into the rotated RPN network to generate high-quality rotated candidate regions. ROI transformer [51] devises an ROI transformer module to transform horizontal ROIs into rotated ROIs, thus avoiding producing a large number of anchors and alleviating misalignment problems. Oriented RCNN [53] uses an oriented RPN network to directly generate rotation ROIs, which eliminates the accuracy loss incurred by transforming horizontal ROIs into oriented ROIs.

B. Aerial UAV Datasets

Over the past decades, several UAV datasets have been proposed and employed in various tasks such as object counting, detection, and tracking. Robicquet et al. [56] proposed the STANFORD CAMPUS, which collects image and video data from eight scenes at the Stanford campus. The dataset includes six common categories, such as pedestrian, car, and so on. Zhu et al. [57] introduced the VISDRONE dataset, a high-resolution UAV dataset that comprises more than 200 frames of video and 10 209 UAV images. This dataset provides rich auxiliary data such as bounding boxes, categories, and occlusion ratios. Bozcan et al. [58] proposed the first outdoor multimodal UAV dataset for object detection, i.e., AU-AIR, which contains target location and attribute information as well as UAV flight statistic data. Du et al.[59] proposed a large-scale UAV image dataset named UAVDT for target detection and tracking. The dataset includes more than 80 000 key frames selected from a 10-hour video, consisting of roughly 2700 vehicles annotated by approximately 840 000 bounding boxes for detection and tracking. Lyu et al. [60] proposed UAVid, a high-resolution UAV dataset for semantic segmentation. UAVid is composed of 300 UAV images taken from 30 video sequences captured in urban areas, with annotations classified by eight categories.

To promote the development of small object detection, Akshatha et al. [61] proposed a large-scale UAV pedestrian detection dataset, namely, Manipal-UAV. This dataset includes 33 videos captured by UAVs at the flight height range of 10–50 m. They selected 13 462 images and annotated 153 112 pedestrian targets. Matthias et al. [62] proposed UAV123, a low-altitude UAV dataset for target tracking. This dataset aims to identify different types of objects and serve applications such as target tracking and trajectory prediction. Wang et al. [63] proposed UAVBD, a low-altitude UAV dataset aimed at detecting abandoned plastic bottles in the wild. This dataset comprises 25 407 UAV images with different backgrounds and 34 791 rotating bounding boxes for bottles. Du et al. [64] proposed UA-DETRAC, a large-scale UAV image dataset for multiobject tracking. They manually annotated 8250 bounding boxes of vehicles in Beijing and Tianjin and provided auxiliary information such as location, illumination, and shooting angles.

C. Vehicle Detection Datasets

In the last decade, there has been an increased attention focused on real-world reflection within datasets, with vehicles being a common object studied in research. In the early stage, datasets such as TAS [20] and OIRDS [21] facilitated the advancement of automated vehicle detection, by employing vehicles as detection categories in satellite remote sensing images. However, the low image resolution of these datasets brings difficulty to accurately reflecting real-world scenarios. With the development of sensor technology, vehicle detection datasets with high-resolution images such as UCAS-AOD [25], VEDAI (2015), and DLR-3K [22] were proposed. Nevertheless, their limited sample quantities hampered their practical application. Similarly, COWC [24] provides a large number of detection targets, 32 716 in total, but its low image resolution and center point-based annotation method impeded the applications of detection algorithms.

The advent of deep learning has resulted in a higher demand for large-scale datasets. Consequently, CPRPK [26] built a large-scale aerial dataset for vehicle detection and counting. It was collected from a parking lot by a drone, with a total of 89 777 vehicle targets. The limitation of dataset is that the scenes in CARPK are too similar to reflect the complexity of the real world. The HighD [23] dataset used drones to capture orthographic images above German highways, ranging from 100 to several hundred meters in elevation. However, its utility is restricted by its simple background and its inability to apply to complex scenes.

Recently, many datasets have been dedicated to reflecting complex scenes of the real world, which contain more complex background information and instances. For example, DOTA [47] is a large-scale dataset for object detection in aerial images, mainly containing 2806 aerial images captured from different sensors and platforms. DOTA provides the ability to evaluate object detection and rotated object detection in aerial images. Vehicles are considered as a major detection category in this dataset, with a total of 43 462 objects. MOHR [12] collected 10 631 UAV images from suburban areas, including 12 602 trucks and 25204 cars annotated for detection evaluation. VAID [69] collected 6000 aerial images under different lighting conditions in Taiwan. It classifies vehicles into seven categories, like sedan, minibus, truck, pickup, bus, cement truck, and trailer. Currently, the largest dataset for vehicle detection in aerial images is EAGLE [67], which involves 8820 aircraft aerial images shot under various weather, lighting, and humidity conditions. EAGLE contains a total of 215 986 detection targets, including 208 963 small vehicles and 7023 large vehicles.

Although the above datasets cover many real-world scenarios, few datasets pay attention to the influence of shooting angles on the shape and appearance of vehicles. They often use a single vertical view, ignoring the influences caused by the various oblique views. In contrast, the proposed PARA contains a large number of oblique view images and uses a novel object annotation, PBB, to compactly enclose the vehicles in the oblique UAV imagery.

SECTION III.

PARA Dataset

In this section, we will introduce the proposed PARA dataset, including the source of images, the selection of categories, and the specially designed annotation method. We also make a comprehensive comparison between PARA and other related benchmark datasets in vehicle detection, which is presented in Table I.

TABLE I Comparison Between PARA and Other Vehicle Datasets
Table I- Comparison Between PARA and Other Vehicle Datasets

A. Image Collection

PARA dataset aims to reflect the complex urban scenarios with the UAV images taken from various different views, and thus enhance the generalization ability of current detection methods. To this end, we collected 1025 UAV images from a variety of diverse urban scenes, including urban roads, parking lots, crossings, building, and highways. For clarity, some original images in PARA are shown in Fig. 2. All the images are captured by different camera-equipped drones, such as DJI Air 3, under different illumination, resolution, and background to increase the diversity of PARA. Moreover, to avoid a single vertical downward viewing angle, we ensure that the drones collect images with different flight heights and observation angles. The varying flight and views allow our dataset to cover a wide range of real-world scenes vehicles differing in several aspects.

Fig. 2. - Examples of different urban scenarios in PARA. The first row is the residential areas; the second row is the main roads; the third row is the crossroads; and the fourth row shows the parking lots.
Fig. 2.

Examples of different urban scenarios in PARA. The first row is the residential areas; the second row is the main roads; the third row is the crossroads; and the fourth row shows the parking lots.

B. Category Selection

In PARA dataset, we divide vehicles into two categories, i.e., static vehicle and dynamic vehicle, to enable the flexibility of our dataset for severing different applications. This is also a supplement to the existing datasets for vehicle detection. Existing vehicle datasets often choose several common categories (e.g., large vehicles, small vehicles, etc.) based on the size of vehicles. Such categories can meet the needs of basic applications such as vehicle counting, object detection and so on. However, it is struggling to serve specific applications. For example, traffic monitoring and management are highly complex tasks due to the drastic increase in the number of vehicles, which need to figure out whether vehicles are moving. The limited categories of existing vehicle datasets make them hard to solve the traffic-related problems, which are common in modern urban applications. Therefore, we divide the annotated vehicles in the PARA dataset into static vehicles and dynamic vehicles. They are labeled according to whether the kind of vehicles is moving and this is judged by experts in UAV image interpretation. Moreover, to ensure the diversity of categories in the PARA dataset, we also include pedestrians as a category in our dataset, which plays an important role in exploring the real world.

In Fig. 3(a), we show the quantity distribution of objects in PARA, with a total of 58 561 annotated instances. The instances in PARA include 23 118 static vehicles, 21 647 dynamic vehicles, and 13 796 pedestrians, and each instance has two bounding boxes, HBB and PBB. The number of each class of objects is shown in Fig. 3(b). It can be seen that static vehicles and dynamic vehicles constitute the majority of the samples in the dataset and their distribution is relatively balanced.

Fig. 3. - Distribution of different object categories in PARA. (a) Proportion of each category in PARA. (b) Quantity of each category in PARA.
Fig. 3.

Distribution of different object categories in PARA. (a) Proportion of each category in PARA. (b) Quantity of each category in PARA.

C. Data Processing

Data processing is important for building a highly available dataset. Before annotation, we first manually discard a part of PARA images with poor quality, such as those blurred or broken images. Then, to ensure that the images cover vehicles of various scales and aspect ratios, UAV images captured at different flight heights are uniformly selected. At the same time, the orientation information is also considered. The selected images in PARA are kept to contain different orientations as much as possible.

Moreover, we have developed a new annotation tool based on the labelme to outline the parallelogramlike bounding boxes for convenience. When annotating, we just need to manually find three points on the outline of a single object. Then, our tool can automatically generate a complete bounding box and compute the orientation angle of the annotated box. Therefore, we not only provide the coordinates of the original vertices, but also the orientation information of all PARA objects. The complete annotation of a single object contains the coordinates of three adjacent corners as well as the orientation degree, which ranges between 0 and 360 indicating the angle of the object head with respect to the trigonometric circle.

D. Annotation Method

In computer vision, the annotation method determines the representation of the instances and the parameters that the detectors need to learn. The compact annotated bounding boxes can contribute to separating the densely crowded objects, providing accurate semantic information to detectors. The HBB is widely used in different vision tasks, and can be denoted by (xc,yc,w,h), where (xc,yc) and (w,h) are the center location and the size of a bounding box, respectively. Although objects in natural images can be well represented by HBB, the majority of instances with arbitrary orientations in aerial images cannot be compactly outlined by this method. In order to solve this problem, the RBB was proposed [48] for precisely locating rotated objects in aerial images. RBB additionally adds a parameter to denote the orientation of the bounding box, which can effectively separate the packed objects and reduce the background noise of the annotation in the orthographic aerial images.

While RBB can address the problem of detecting crowded objects in orthographic aerial images, this method struggles to enclose rotated objects with large geometric distortions in oblique UAV images. Objects in oblique UAV images often suffer from large perspective deformation compared with objects in natural images and orthographic aerial images. The rectangular vehicles are usually unable to remain axis-aligned under the oblique views. As a result, HBB and RBB are prone to failing to represent the accurate shape of vehicles due to the limitation of the right angle. Considering the complex backgrounds in urban scenarios and geometry distortion of vehicles, we develop a more flexible annotation method called PBB to accurately represent the vehicles in oblique UAV images. PBB is a simple and effective object representation, denoted by {(xi,yi)|i=1,2,3}, where (xi,yi) represents the vertex of PBB. The fourth vertex of the PBB satisfies the following constraints: x4=x3+(x2x1);y4=y3+(y2y1). When annotating, we keep the long side and short side of PBBs align with the length and width of vehicles, respectively. With no restriction on the right angle, PBB can effectively encode the shape of objects with distortions caused by linear perspective in UAV imagery. To better illustrate the effectiveness of PBB, we show the different annotation methods applied for crowded vehicles in oblique UAV images in Fig. 4. Both of HBB and RBB introduce heavier background noise into the bounding box compared with PBB. In contrast, PBB can effectively reduce the background noise and the overlap of between crowded bounding boxes. In addition, to meet the needs of different research, we also provide manually annotated HBB of objects in PARA.

Fig. 4. - Comparison of HBB, RBB, and PBB. (a)–(c) Differently annotated vehicles in crowded scenes. PBB can more compactly enclose the vehicles than HBB and RBB, which effectively reduces the noise from the background and the overlap between bounding boxes.
Fig. 4.

Comparison of HBB, RBB, and PBB. (a)–(c) Differently annotated vehicles in crowded scenes. PBB can more compactly enclose the vehicles than HBB and RBB, which effectively reduces the noise from the background and the overlap between bounding boxes.

SECTION IV.

Properties of PARA

This section illustrates the main characteristics of PARA, including large scale high-resolution images, various views, differently scaled instances, and so on. Fig. 5 displays the statistical information of PARA in detail.

Fig. 5. - Image information and object statistics for instances in PARA. AR denotes the aspect ratio. Other images refer to those in the dataset with sizes ranging from 1000 to 3000 pixels. (a) Statistics of image resolution. (b) Distribution of vehicle orientation. (c) Distribution of vehicle length. (d) Density distribution of vehicles in each image. (e) Distribution of AR for HBBs. (f) Distribution of AR for RBBs.
Fig. 5.

Image information and object statistics for instances in PARA. AR denotes the aspect ratio. Other images refer to those in the dataset with sizes ranging from 1000 to 3000 pixels. (a) Statistics of image resolution. (b) Distribution of vehicle orientation. (c) Distribution of vehicle length. (d) Density distribution of vehicles in each image. (e) Distribution of AR for HBBs. (f) Distribution of AR for RBBs.

A. Large Scale

PARA consists of 1025 UAV images and 117 122 manually annotated bounding boxes, covering several common vehicle categories. The majority of original images in PARA have sizes of 3956 × 5280 pixels, 4000 × 6000 pixels and 3648 × 5472 pixels, whereas in the natural datasets the sizes of images rarely exceed 1000 × 1000 pixels (e.g., COCO [32] and Microsoft VOC [33]). In Fig. 5(a), we display the different resolutions of images in PARA. The high-resolution PARA images ensure a real representation of natural scenarios.

B. Various Orientations of Instances

The orientation is an important attribute of instances in object detection from UAV images. The orientation of instances not only represents the relative relationship between objects in the real world, but also has a significant impact on the feature extraction with rotation invariance. PARA provides abundant orientation information of vehicles for detectors. As shown in Fig. 5(b), the orientation angles of PARA vehicles fully distribute between 0 and 360.

C. Multiscale Instances

The different UAV altitudes result in the different size of an instance. We adjusted the flight height of UAVs between 50 and 350 m during the collection process due to the actual sizes of vehicles in the real world do not differ significantly. The varying flight heights ensure that PARA can capture different sizes of vehicles in natural scenes, which is helpful to train a robust detector. Fig. 5(c) illustrates the notable size differences of objects in PARA.

D. Various Density Distribution of Instances

PARA is designed specifically for vehicle detection in urban areas. We include several typical natural scenes in modern cities, such as highways, parking lots, intersections and residential areas. Different urban scenes have different background information and the presented vehicles exhibit varying density distributions in different scenarios. As shown in Fig. 5(d), a single image in PARA may contain only a few vehicles or exceed 200 number of vehicles, making PARA highly challenging.

E. Various Aspect Ratios of Instances

The aspect ratio (AR) is an important attribute of the dataset, which provides essential information about the shape and size of instances. In anchor-based detection algorithms, AR serves as an auxiliary factor that affects model design and algorithm effectiveness. For example, YOLOv3 [42] employs the k-means algorithm [70] to cluster the initial anchor sizes and ratios. We calculate two kinds of AR for all objects in PARA to provide a reference for subsequent research. Fig. 5(e) and (f) illustrates the aspect ratio of manually annotated RBBs and PBBs in PARA.

SECTION V.

Experiments

In this section, we evaluate the mainstream detectors on PARA by detecting objects with HBB and PBB, respectively. In the following, we will introduce the experimental setup, baselines of different detection tasks, experimental results, and analysis in detail.

A. Experimental Setup

In our experiment, we randomly split the sample in PARA into three parts: a training set of 511 images, a validation set of 168 images, and a testing set of 346 images, respectively. As the original PARA images are too large to be fed into the existing detectors for training, we crop them into patches of 1024 × 1024 pixels, with 50% overlapping between neighboring patches. Finally, we get 32 903, 11 433, and 21 809 patches for training, validation, and testing, respectively. To make a fair comparison between the baseline detectors, all models are implemented with the open-source MMDetection [71] and trained with a single GeForce RTX 3090Ti GPU. We select and evaluate Faster RCNN [36], DAB-DETR [37], Cascade RCNN [38], RTMDet [72], YOLOv3 [42], SSD [43], EfficientNet [73], RetinaNet [46], Deformable DETR [74], and FCOS [75] with ResNet50, Efficientnet B3, VGG16, CSPNeXt, DarkNet53 backbones as the baseline detectors. Specifically, we modify the original Faster RCNN [36] and RetinaNet [46] to detect vehicles with PBBs denoted by. {(xi,xi),i=1,2,3}. All the model settings are kept the same as the default setups in MMDetection [71].

B. Evaluation Metric

For the evaluation metric, we adopt mAP, the mainstream protocol in the field of object detection, to evaluate the performance of the selected baseline detectors. The mAP stands for the mean average precision (AP), which is calculated by the area under the Precision-Recall (PR) curve. PR curve can be depicted with different scores of detection precision and detection recall. The calculation of AP can be depicted as follows:

Precision=Recall= AP=TPTP+FPTPTP+FN10P(R)d(R)(1)(2)(3)
View SourceRight-click on figure for MathML and additional features.where TP (true positive) refers to the number of correctly predicted bounding boxes, i.e., with an IoU score higher than the IoU threshold. FP (false positive) and FN (false negative) are the number of bounding boxes that are predicted incorrectly and not detected, respectively. The IoU threshold is often set to 0.5 and 0.75, and the corresponding mAP is denoted as mAP50 and mAP75. In addition to the commonly used mAP50 proposed in PASCAL VOC [33], we also choose mAP75 as the evaluation metric to judge whether the bounding boxes compactly enclose the targets.

C. HBB Baseline

Most of datasets for rotated object detection in aerial imagery (such as DOTA [47] and EAGLE [67]) generate HBB ground truths of instances by calculating the HBBs of RBBs. We find that the HBBs of PBBs are larger than the actual instances in oblique UAV images due to geometric distortions. Therefore, we obtain the accurate HBBs of vehicles in PARA by manual annotation.

We train the baseline models with their default hyperparameters and strategies to ensure a fair comparison. Table II displays the results of HBB prediction. Cascade RCNN [38] outperforms all the other detectors with an mAP of 79.79% for its effective training strategy based on a multistage network. The other detectors show impressive performance in our dataset with mAP50 over 77%, except for the EfficientNet [73]. We suspect that this can be attributed to the the default backbone Efficientnet-B3 in MMDetection [71], which is sensitive to the size of the input images. It is worth mentioning that the other one-stage detectors, such as YOLOv3 [42] and SSD [43], spend more time on training in comparison with the two-stage algorithms. This may be the reason why they can achieve comparable performance with the two-stage detectors. Compared with the metric of mAP50, the mAP scores of all the detectors decrease by over 10% under the stricter metric of mAP75. Benefiting from the anchor-free strategy, FCOS [75] achieves the best performance with a mAP score of 68.66%. In the table, EfficientNet [73] presents poor performance compared with the other detectors. We find that the decrease in mAP under the metric of mAP75 is mainly caused by the difficulty in detecting pedestrians, whose sizes are too small to be accurately located. Moreover, all the detectors achieve better performance in detecting dynamic vehicles with clear backgrounds (like roads) than static vehicles with complex backgrounds.

TABLE II Benchmark of the State-of-The-Art on the HBB Under mAP50 and mAP75 Metrics
Table II- Benchmark of the State-of-The-Art on the HBB Under mAP50 and mAP75 Metrics

D. PBB Baseline

Most of mainstream detectors are designed for the objects represented by HBB or RBB. It is not feasible to directly apply them as a benchmark for the PBB-based detection task. Thus, we choose and modify Faster RCNN [36] and RetinaNet [46] as the baseline for PBB detection due to their efficiency.

We modified the region proposal network (RPN) and the head of the convolution neural network (CNN) in the original Faster RCNN. Region proposals generated by the modified RPN are utilized to match the HBB of the PBB ground truths, which can be denoted by (rxc,ryc,rw,rh); Then, each region proposal is fed into CNN to attach the single PBB ground truth represented by (gxc,gyc,gw,gh,gt,gl). In detail, (gxc,gyc,gw,gh) denotes the external bounding box of the PBB and (gt,gl) represents the relative offsets of the top vertex and the left vertex in the PBB to the left-top vertex of the external bounding box. Finally, the 6-D target vector produced by the head of the modified CNN can be written as P={pxc,pyc,pw,ph,pt,pl}, where

pxc=pw=pt=(gxcgxc)/rw,pyc=(gycryc)/rhlog(gw/rw),ph=log(gh/rh)gt/rw,pl=gl/rh.(4)(5)(6)
View SourceRight-click on figure for MathML and additional features.Similarly, we modify the original RetinaNet to regress the offsets of parallelograms to their corresponding external bounding boxes. To make a comprehensive evaluation of our modified PBB-based baselines, we also train the other mainstream detectors based on HBB annotations and evaluate the predicted results with the PBB ground truths for convenience in the PBB task.

Table III displays the results of PBB-based vehicle detection. The two PBB baselines outperform the other original state-of-the-art detectors trained with HBB. The improvement is particularly notable for the mAP75 metric, showing an increase of approximately 16% in mAP score. The results of the PBB-based detection show significant differences between PBB and HBB representation for vehicles with large geometric distortion. We also observe that the improvement in mAP score is mainly attributed to the categories of static vehicle and dynamic vehicle, with increases of around 30% and 21% in mAP, respectively. In comparison, the detection accuracy of pedestrians does not demonstrate a substantial increase due to their small size in UAV images. Overall, our findings indicate that for objects with large deformation in oblique UAV images, the PBB presentation is superior to HBB in precise and compact detection of vehicles.

TABLE III Benchmark of the State-of-The-Art on the PBB Under mAP50 and mAP75 Metrics
Table III- Benchmark of the State-of-The-Art on the PBB Under mAP50 and mAP75 Metrics
SECTION VI.

Discussion and Analysis

In this section, we will present some interesting discussions and analysis of our experimental results. When comparing the HBB detection results in Table II with the PBB detection results in Table III, we observe that the detectors trained with HBB ground truths achieve similar performance in both HBB and PBB detection tasks under the metric of mAP50. To explain the reason behind this phenomenon, we visualize the predicted results of different tasks in Fig. 6(a) and (b). The IoU scores beside the predicted boxes in Fig. 6 can be utilized to evaluate the accuracy of predicted results, with higher values indicating better overlap between the predicted and ground truth bounding boxes. We find that the predicted PBBs in Fig. 6(a) compactly fitting ground truths, while the predicted HBBs enclose massive backgrounds. However, the predicted HBBs are also regarded as TP samples under the metric of mAP50 with a poor IoU score of 0.5 or 0.6. The results indicate that the mAP50 metric has some limitations and is unable to accurately reflect the matching degree of the predicted and ground truth boxes in oblique UAV images. We should adopt a more stringent evaluation metric like mAP75 for object detection in UAV images, where objects often suffer from large distortion. Moreover, this finding also reveals that the same detectors can achieve a completely different accuracy under the mAP75 metric, with many low-quality HBB boxes being filtered out significantly. In contrast, PBB predicted boxes achieve good performance in matching with ground truths, with IoU scores converging in the range of 0.85–0.95. It can not only fit the targets closely, but also significantly reduce overlapping with each other.

Fig. 6. - Comparison of the IoU scores between HBB and PBB predicted boxes with PBB ground truths. The results show that the IoU of predicted HBB boxes is relatively low, while the predicted PBB achieves much better performance.
Fig. 6.

Comparison of the IoU scores between HBB and PBB predicted boxes with PBB ground truths. The results show that the IoU of predicted HBB boxes is relatively low, while the predicted PBB achieves much better performance.

To quantitatively evaluate the matching degree of HBB and PBB predicted boxes with PBB ground truths, we collect over 9000 predicted boxes from different tasks on the validation set and visualize their distribution of IoU scores with ground truths in Fig. 7. We consider a predicted box with an IoU score above 0.8 as a positive sample. In Fig. 7(a), we observe that nearly half of HBB-predicted boxes have a high IoU score over 0.8. The IoU scores of the remaining boxes are distributed in the range of 0.4–0.8. In comparison, the majority of PBB predicted boxes have IoU scores that converge in the range of 0.8–1.0. The proportion of low-level predicted boxes is smaller in comparison to the HBB predicted boxes. In Fig. 7(b), we utilize the method of box plot to depict the distribution of IoU scores between PBB/HBB predicted boxes and the ground truths. The IoU score distribution of PBB is clustered, with a minimum around the threshold of 0.5. In comparison, the IoU distribution of HBB is scattered, showing a poor performance in matching with ground truths. Therefore, PBB is more capable of accurately representing rotated vehicles in oblique UAV images.

Fig. 7. - Quantitative comparison between the HBB and PBB predicted boxes. (a) IoU distributions of the HBB and PBB predictions evaluated by PBB ground truths. (b) Boxplot of the IoU distribution of PBB and HBB boxes.
Fig. 7.

Quantitative comparison between the HBB and PBB predicted boxes. (a) IoU distributions of the HBB and PBB predictions evaluated by PBB ground truths. (b) Boxplot of the IoU distribution of PBB and HBB boxes.

In Fig. 8, we select different urban scenes and compare the results between the detection results with PBB and HBB. We observe that HBB detectors classify several flower beds and trees as vehicles. This is because too much background noise contained in the bounding box disturbs the network learning. For densely arranged vehicles, the localization of objects with PBB is obviously more accurate than with HBB. HBB detectors tend to suppress some packed detected boxes by some postprocessing operations like NMS. PBB can well address this problem by compactly enclosing crowded vehicles, resulting in better performance in crowded vehicle detection. Moreover, the loose representation of HBB causes predicted boxes to overlap with each other, while PBB can correctly reflect the actual orientation and size of vehicles in oblique UAV images. In terms of different categories, we find that the detection accuracy of static vehicles is slightly lower than that of dynamic ones. This is because static vehicles are often occluded by surrounding objects. It is relatively easy for detectors to detect dynamic vehicles with clear backgrounds.

Fig. 8. - Visualization results of testing on PARA using well-trained original and modified Faster RCNN. (a) and (b), respectively, illustrate the predicted HBB and PBB boxes in different urban scenarios.
Fig. 8.

Visualization results of testing on PARA using well-trained original and modified Faster RCNN. (a) and (b), respectively, illustrate the predicted HBB and PBB boxes in different urban scenarios.

SECTION VII.

Conclusion

We build a large-scale dataset for vehicle detection in UAV images, namely PARA, which features a novel representation of vehicle object under oblique UAV views. Compared with the traditional annotation methods, the proposed PBB can compactly enclose the targets and provide accurate semantic information to detectors. In addition to the PBB representation, we collect a large number of high-resolution images captured in complex urban environments and manually annotate many rotated vehicles with different bounding boxes. We also evaluate the performance of several mainstream object detectors on PARA to establish a benchmark for precise vehicle detection in urban scenarios. Experimental results demonstrate that it remains challenging for detectors to accurately detect vehicles with significant deformations in complex urban scenarios.

High resolution oblique UAV images are now easily accessible with cheap drones, which provide rich information for many modern urban practical applications such as traffic monitoring, vehicle management, and urban planning. However, objects in oblique UAV images often suffer from large perspective deformation, bringing a huge challenge for detection and analysis. We believe that the findings with the PARA dataset for compact vehicle detection can not only bring benefits addressing urban issues but also attract more attention to object detection in oblique UAV images.

No comments:

Post a Comment

When RAND Made Magic + Jason Matheny Response

Summary The article describes RAND's evolution from 1945-present, focusing on its golden age (1945-196...