Summary
- The paper presents a novel computer vision-based method for enhancing UAV navigation by enabling accurate height and location estimation without relying on GPS.
- The approach uses a deep neural network that takes a pair of images from a UAV camera to estimate the UAV's height.
- It employs a pyramid stereo-matching network to extract image features and generate a disparity map, which is then processed to achieve precise height estimation.
- The researchers collected a custom dataset using a Phantom 4 Pro drone, capturing images at 10 different heights from 100-280m over a university campus.
- The proposed network (PSMHENet) outperformed existing methods, achieving an average error of 4.4m on the validation set and 4.6m on the test set.
- Key challenges addressed included illumination differences, lack of prominent features in some images, and insufficient image overlap.
- The method enables 3D localization of UAVs by combining the height estimation with 2D localization using image matching against an orthomosaic.
- Potential applications include enabling successful UAV flights in GPS-denied or challenging environments.
- Future work could focus on improving feature extraction, using more diverse datasets, and further enhancing accuracy.
The paper demonstrates a promising vision-based approach for UAV height estimation and 3D localization that reduces reliance on GPS for navigation.
Architecture
- Upper Part (Height Estimation):
2. Feature Extraction:
- Both images are processed through a shared-weight convolutional neural network.
- This network extracts relevant features from both images.
3. Cost Volume Calculation:
- The extracted features are used to create a 4D cost volume.
- This volume represents the matching costs between features in the two images across different disparities.
4. 3D Convolutions:
- The cost volume is processed using 3D convolutional layers.
- These layers help in learning complex patterns in the disparity space.
5. 2D Convolutions:
- Further processing is done using 2D convolutional layers.
- This helps in refining the disparity estimates.
6. Fully Connected Layers:
- The processed features are fed into fully connected layers.
- These layers produce the final height prediction.
- Lower Part (2D Localization):
2. Template Matching:
- The input image is matched against the orthomosaic.
- This process determines the 2D position of the UAV within the known area.
How it works:
1. Height Estimation:- The stereo images are used to estimate the height of the UAV.
- By comparing the disparity between matching features in the two images, the system can calculate the distance from the ground.
- The neural network learns to refine this estimation based on the patterns it observes in the training data.
2. 2D Localization:
- By matching the current view with the pre-stored orthomosaic, the system determines where the UAV is positioned horizontally.
3. 3D Localization:
- The height estimation from the upper part is combined with the 2D position from the lower part.
- This combination provides a complete 3D localization of the UAV.
The key innovation here is the use of deep learning techniques to improve the accuracy of height estimation, which is then combined with traditional image matching for horizontal positioning. This approach allows for accurate 3D localization without relying on GPS, which can be crucial in GPS-denied environments or when GPS signals are unreliable.
Authors
1. Mansoor Khurshid
- Affiliation: School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan
- Education: Received a master's degree in computer science from NUST in 2023
- Current role: Project Manager responsible for managing IT and AI projects
- Research interests: Computer vision and deep learning
2. Muhammad Shahzad (Senior Member, IEEE)
- Affiliations:
- School of Electrical Engineering and Computer Science (SEECS), NUST, Islamabad, Pakistan
- Chair of Data Science in Earth Observation, Technical University of Munich (TUM), Germany
- Education:
- B.E. in electrical engineering from NUST
- M.Sc. in autonomous systems (robotics) from Bonn Rhein Sieg University of Applied Sciences, Germany
- Ph.D. in radar remote sensing and image analysis from TUM
- Prior work: Guest Scientist at the Institute for Computer Graphics and Vision, Technical University of Graz, Austria (2015-2016)
- Research interests: Applications of deep learning in remote sensing and computer vision, processing 3D point clouds, optical RGBD images, and high-resolution radar data
3. Hasan Ali Khattak (Senior Member, IEEE)
- Affiliations:
- School of Electrical Engineering and Computer Science (SEECS), NUST, Islamabad, Pakistan
- Faculty of Computing and Information Technology, Sohar University, Oman
- Education: Ph.D. in electrical and computer engineering from Politecnico di Bari, Italy (2015)
- Research interests: Future internet architectures, web of things, data sciences and social engineering for smart cities, machine learning applications in healthcare and transportation
4. Muhammad Imran Malik
- Affiliation: School of Electrical Engineering and Computer Science (SEECS), NUST, Islamabad, Pakistan
- Education:
- Master's and Ph.D. in artificial intelligence from the University of Kaiserslautern, Germany
- Prior work: German Research Center for Artificial Intelligence GmbH (DFKI), Kaiserslautern
- Current role: Associate Professor and Head of the Computer Science Department at NUST
- Research interests: Machine learning and pattern recognition, image analysis and understanding
5. Muhammad Moazam Fraz (Senior Member, IEEE)
- Affiliation: School of Electrical Engineering and Computer Science (SEECS), NUST, Islamabad, Pakistan
- Education:
- B.S. in software engineering from Foundation University, Islamabad
- M.S. in software engineering from NUST
- Ph.D. in computer science from Kingston University London, UK
- Prior work:
- Software Development Engineer at Elixir Technologies Corporation
- Postdoc Research Fellow at Kingston University in collaboration with St George's University of London and UK BioBank
- Current roles:
- Full Professor at NUST
- Rutherford Visiting Fellow at The Alan Turing Institute, UK
- Research interests: Applications of machine learning/computer vision techniques for diagnostic retinal image analysis
The authors have a strong background in computer science, electrical engineering, and artificial intelligence, with a focus on applications in computer vision, remote sensing, and image analysis. Their institutional affiliations span multiple countries, indicating international collaboration and expertise.
Data Bases and Artifacts
1. Custom Dataset:
- Collected using a Phantom 4 Pro drone
- Location: NUST Main Campus, H-12 Islamabad, Pakistan
- 10 different flight heights: 100m to 280m, in 20m increments
- Initially 1781 images collected
2. Data Augmentation:
- Original dataset was augmented using flipping and rotation techniques
- Augmented dataset details are presented in Table II (specific numbers not provided in the summary)
3. Dataset Split:
- Training Set: Data from 8 heights (100, 120, 140, 160, 200, 240, 260, and 280 m)
- Testing Set: Data from 2 heights (180 and 220 m)
4. Data Preprocessing:
- Images scaled from original 5472 x 3648 resolution to 512 x 256
- Contrast Limited Adaptive Histogram Equalization (CLAHE) applied to address illumination variations
- Removal of image pairs with less than 70% overlap
- Elimination of image pairs lacking prominent features
5. Hardware:
- Training conducted on a Linux machine with a 22-GB GPU
The paper does not mention the use of any pre-existing databases. Instead, it emphasizes the creation and use of a custom dataset specifically collected for this research. The authors note that this custom dataset was made publicly available for the benefit of the research community, though specific details on how to access it are not provided in the summary.
The PSMHENet Network
PSMHENet:
- Based on the Pyramid Stereo-Matching Network (PSMNet)
- Tailored specifically for UAV height estimation
- Uses a pair of images to extract features, calculate disparity, and estimate height
- Incorporates spatial pyramid pooling (SPP) and 3D convolutions for better context understanding
- Combines height estimation with 2D localization for complete 3D positioning
Alternatives Evaluated:
1. PSMNet (original):
- Base network used for comparison
- Designed for stereo matching and disparity estimation
2. Siamese-based image-matching technique:
- Uses a patch-based approach for image matching
- Calculates disparity between matched features for height estimation
3. Vision-based 3D localization technique by Yol et al.:
- Uses mutual information (MI) for robust matching
- Designed to handle local and global scene variations
Why PSMHENet performed better:
1. Full Image Context: Unlike the Siamese network's patch-based approach, PSMHENet considers the entire image, providing a more comprehensive understanding of the scene.
2. Spatial Pyramid Pooling: This allows the network to capture multi-scale information, improving feature extraction and matching.
3. 3D Convolutions: These help in learning complex patterns in the disparity space, leading to more accurate height estimations.
4. Tailored for UAV Imagery: The network was specifically designed and trained on UAV imagery, making it more suitable for this application compared to general-purpose stereo matching networks.
5. End-to-end Training: The network is trained in an end-to-end manner, optimizing all components together for the specific task of height estimation.
6. Illumination Invariance: The use of CLAHE preprocessing helped in handling illumination variations in the dataset.
7. Data Cleaning: The authors implemented a data cleaning process to remove problematic images, improving overall performance.
Performance Comparison:
- PSMHENet achieved an average error of 4.4m on the validation set and 4.6m on the test set.
- This was a significant improvement over the initial results (10m average error) and outperformed the alternatives.
In summary, PSMHENet's superior performance can be attributed to its tailored design for UAV height estimation, consideration of full image context, robust feature extraction and matching techniques, and careful data preprocessing and cleaning. These factors combined to create a more accurate and reliable system for UAV 3D localization without relying on GPS.
Vision-Based 3-D Localization of UAV Using Deep Image Matching
Abstract:
Unmanned aerial vehicles (UAVs) have revolutionized various industries by providing efficient and automated flight capabilities. However, reliance on GPS and traditional navigation systems poses challenges in scenarios where signal interference or failures occur. In this research, we present a novel computer-vision-based method to enhance UAV navigation, enabling accurate height and location estimation.
Our approach utilizes a sophisticated network that leverages a pair of images to estimate UAV height. The pyramid stereo-matching network is employed to extract robust image features and generate a disparity map. Subsequently, a custom network processes and convolves these data, employing diverse computer vision techniques to achieve precise height estimation. To evaluate the effectiveness of our proposed method, we collected a comprehensive dataset by conducting flights with a Phantom 4 Pro drone over the NUST Main campus, H-12 Islamabad. The dataset encompasses images captured at 10 different heights, spanning from 100 to 280 m, with flights evenly spaced 20 m apart. In rigorous evaluations, our approach demonstrates promising results compared to existing methods. By liberating UAVs from reliance on GPS, this vision-based 3-D localization technique holds immense potential to ensure successful flights even in challenging environments.
Introduction
Image matching is a fundamental task in machine vision with significant applications in depth estimation, motion detection, tracking, 3-D object reconstruction, and height estimation. It involves comparing pairs of images captured by a camera with a slight shift in camera position. Feature matching is employed to measure the similarity between image pairs and calculate the disparity, which is subsequently used to determine the depth maps.
One specific application of image matching is in the field of vision-based autonomous navigation, where unmanned aerial vehicles (UAVs) [1] utilize visual cues for self-localization. In scenarios where GPS and altimeter failures occur due to factors such as jamming, technical issues, or unforeseen circumstances, reliable localization becomes crucial for UAVs. As UAV localization heavily relies on onboard GPS and altimeter systems, malfunctions can render the UAV ineffective. Thus, alternate approaches are essential aids in autonomous flight in such scenarios.
UAVs are typically equipped with high-resolution cameras that can be leveraged to estimate the UAV height from the ground. This involves an image-matching procedure, where features are extracted from two images to establish correspondences. The resulting correspondences are then utilized to calculate the disparity and depth maps. However, this pipeline is complex due to factors such as viewpoint changes, variations in illumination, and dynamic prominent features, making image matching and depth estimation highly challenging.
Various models based on mathematical transforms (e.g., Fourier and wavelet) [2], [3] and rotation and scale-invariant feature descriptors (e.g., SIFT, SURF, and histogram of dominant gradients) [4], [5], [6], [7]
have been proposed. These models work by extracting key points from two
images to obtain feature maps, which then enable feature matching by
establishing correspondence among those images. Approximate
nearest-neighbor search algorithms, coupled with postprocessing
procedures (hierarchical
Motivated by the concepts of deep CNNs and end-to-end learning, this article presents a complete pipeline for end-to-end learning and 3-D localization of UAVs, as illustrated in Fig. 1. The primary focus is to utilize this technique for estimating the UAV height from the ground and integrating it with 2-D data to achieve 3-D localization. The proposed approach involves utilizing a pair of images (UAV camera) to extract features and establish correspondence, enabling disparity calculation and subsequent determination of the UAV depth/height from the ground level. Within the proposed architecture, the key contributions include the following.
Designing an end-to-end trainable network that estimates the UAV height using pairs of images for 3-D localization. The pyramid stereo-matching network [17] is employed as the base network for feature extraction from image pairs and subsequent disparity map calculation, which is adopted and enhanced to utilize for the estimation of the UAV height.
Demonstrating the proposed model on real-time UAV imagery by acquiring training and testing data through multiple drone flights.
Collecting a unique custom dataset by flying a Phantom 4 Pro drone within the premises of the NUST Main Campus H-12 Islamabad, Pakistan. This dataset is made publicly available for the benefit of the research community.
Related Work
Numerous researchers have made substantial contributions to the fields of image matching and depth estimation, employing a wide array of techniques, including invariant feature descriptors and CNN-based approaches. Our research specifically focuses on two significant research areas 1) image matching to calculate disparity and 2) depth estimation.
Image matching forms the cornerstone of computer vision, playing a pivotal role in essential tasks, such as image correspondence, optical flows/disparity, and person reidentification, among others. Traditional image-matching techniques employ Siamese-based networks to find correspondences. However, recent advancements have seen the emergence of more sophisticated methods. Mughal et al. [19] explore the use of deep convolutional neural networks for real-time 2-D localization of UAVs, leveraging the UAV camera and locally stored orthomosaic images. For disparity estimation, Mayer et al. [20] contribute to the field by providing three realistic, diverse, and large-scale synthetic stereo video datasets, enabling effective convolutional network training for real-time disparity estimation. Zhang et al. [21] focus on efficient disparity estimation, incorporating squeezed cost volumes and attention-based spatial residual modules. Chang and Chen [17] utilize a pyramid stereo-matching network to establish image correspondences for disparity calculation. Their network employs shared weights to extract features from two images, and a 4-D cost volume is employed to derive the disparity.
For depth estimation from monoscopic imagery, Haseeb et al. [22] provide a machine learning configuration that gives the obstacle detection system a way to calculate the distance between the monocular camera and the item being seen by the camera in order to enhance self-supervised monocular distance estimation on fisheye and pinhole camera pictures. Kumar et al. [23] offer a unique multitask learning technique to improve self-supervised distance estimation on monocular fisheye camera images. Ciganek et al. [24] give a general overview of measuring techniques and how to determine the distance between an object and a camera. They initially identified the corners in an image and the distance between them and then used mathematical equations to find the distance of camera from the object. Pavlovic et al. [25] presented the application of two different techniques (sensor based and vision based) to find the distance between a robot and an object. They used radar/ LiDAR for the sensor-based technique and used both single and stereo pairs for the vision-based technique. Yoon et al. [26] used both single image as well as pair of images to find the depth by using a depth fusion network.
For depth estimation from a pair of images, there are many methods available for cost volume optimization and matching cost computation, which have been proposed in the literature. Wang et al. [14] introduce the pseudo-LiDAR network, which estimates depth for 3-D object generation using point cloud data. Building on this, pseudo-LiDAR++ network and image calibration data are used for depth estimation [15]. Garg et al. [16] present a novel neural network architecture utilizing a loss function derived from the Wasserstein distance to output arbitrary depth values. Maximov et al. [27] worked on the generalization of depth estimation outside the training set so that accurate results can be achieved for that they have worked on direct supervision of domain invariant defocus and used a convolution neural network to learn from a pair of images with a different point of focus.
For vision-based height estimation of the UAV, Shabayek et al. [28] carried out the comparison of various techniques including horizon-based methods, optical flow, stereoscopic-based techniques and passing by vanishing points. Gökçe et al. [5] worked on distance estimation as well as detection of microunmanned aerial vehicles in different scenarios including intruder UAVs in a protected environment and controlling UAVs for environmental surveillance and monitoring. They tested local binary patterns, histogram of gradients (HOGs), and Haar-like features using the cascade of boosted classifiers. Pan et al. [18] have used both monocular image and stereo images to estimate the approach angle and height of the UAV for autonomously landing. To calculate the approach angle, they extracted vanishing lines by using the Hough transform and RANSAC algorithm, and to calculate the height, feature-based matching is adopted by extracting Harris corner from stereo images and then using approach angle 3-D reconstruction. The Kalman filter model is built by analyzing motion characteristics of the UAV for accurate height estimation. Mondragon et al. [1] worked on estimating UAV height, the pose and motion estimation using visual aid to control the aircraft from the ground. Their objective was to demonstrate that computer vision can be successfully used in control loops and for which fast image processing algorithms were required to close the gap between real-time controls and visual control, i.e., a processing rate of 15 frames per second. They have shown that to enable UAV navigation based on visual input, traditional image-collecting techniques can be combined with ad-hoc image processing and fuzzy controllers to achieve good results. Dhahbane et al. [29] conducted a survey in the domain of engineering (guidance, navigation, and control) to determine the orientation of an aircraft in space with respect to another object. They have compared different techniques including computer vision to determine the attitude (roll, pitch, and yaw) of an aircraft. Yang et al. [30] worked to determine the position and attitude of an aircraft with respect to the horizon by integrating a visual odometer (computer vision) and GPS, which was used to take results so as to minimize the trajectory estimation error. Wan et al. [31] provide a study on UAV localization technique for its autonomous navigation based on matching between onboard UAV picture sequences and a preinstalled reference satellite image. The images compared with each other are not taken under the same illumination conditions; hence, they used illumination invariant tech of Phase Correlation through mathematical deviation and able to estimate the current and next position of the UAV and also apply self-coarse correction in case the UAV is not following the planned path. Liu et al. [32] worked to find the height of the UAV and focus on the issues in optical flow due to rotational and translational movement by using gated recurrent unit neural network. Yol et al. [33] 3-D localize the UAV using the deep image-matching technique.
It is worth noting that while many of the earlier algorithms primarily concentrate on calculating UAV height from ground sensors or during indoor flights, there has been limited exploration specifically dedicated to vision-based height estimation of UAVs using the onboard camera. This particular aspect forms a crucial aspect of our research, where we aim to extend the existing methodologies to enable precise distance estimation for UAV navigation. In the subsequent sections, we will present our approach, methodology, and experimental findings to address this gap in the existing literature.
Methodology
Depth calculation constitutes a comprehensive pipeline that encompasses data acquisition from a pair of images. Subsequently, these images are processed through the network to calculate feature maps, which are further integrated into a 3-D cost volume. This cost volume facilitates the computation of disparity, ultimately leading to the determination of depth from the camera's perspective. The implementation process is divided into several distinct steps, each of which plays a crucial role in comprehending the overall development of the project. In the following sections, we present further details of the proposed architecture.
A. Overview
In this research, we adopt the PSMNet as our base network and tailor it to effectively establish correspondences and calculate disparities between image pairs while simultaneously estimating the height from the ground. Our network is referred to as the Pyramid Stereo-Matching and Height Estimation Network (PSMHENet). Initially, we compare PSMHENet performance with that of the original PSMNet, gradually refining our network's design step by step to achieve improved outcomes. We conduct a comparative analysis between our results and those obtained using PSMNet. Our main contributions encompass the modification of PSMNet and the design of a network specifically dedicated to height estimation, which is appended at the end of PSMNet, effectively integrating the two functionalities.
To perform a comprehensive comparison with our proposed network (PSMHENet), we also employed two state-of-the-art algorithms. First, the Siamese-based image-matching technique [34] (with minor modifications) was employed to establish correspondences, calculate the disparity between the matched features, and subsequently estimate height. The Siamese-based matching network [35] employs a patch-based image-matching technique, which lacks the consideration of the full context of the image. However, the full context is crucial for accurately determining the true height of the drone. Whereas our technique (PSMHENet) not only utilizes the entire image for comparison but also incorporates the full context of the image through two essential techniques 1) spatial pyramid pooling (SPP) and 2) 3-D convolutions, in combination with an hourglass network. This comprehensive approach enables the PSMNet to attain the global context of the image, resulting in more accurate and reliable results for height estimation.
Second, we compared our approach with a vision-based UAV localization technique proposed by Yol et al. [33]. Their method relies solely on visual information and employs a multipass localization algorithm. However, many existing approaches are susceptible to scene variations caused by changes in season or environment due to their utilization of Sum of Squared Differences (SSD). To address this issue, the authors opted for mutual information (MI) as a more robust alternative, capable of accommodating both local and global scene variations.
B. Network Diagram
The network diagram presented in Fig. 1 consists of two distinct parts. The upper part of the network showcases the input of stereo images, which are processed with shared weights to extract features from both images. These features are then amalgamated to form a 4-D cost volume, encompassing width, height of feature maps, and disparity at each location. Subsequently, this 4-D volume is convolved through various layers, reducing the feature map size while increasing the number of feature maps. Following this, the data are passed through fully connected layers to estimate the height.
On the other hand, the lower part of the network employs one image from the stereo pair, along with an orthomosaic, to conduct 2-D localization of the image within that orthomosaic. It highlights the proposed localization pipeline that fundamentally relies on feature point learning and a neighborhood consensus strategy to refine the matches between the template image patch and the prestored orthomosaic. This is accomplished by leveraging the correlation information between the convolutional features of the two images and then applying probabilistic constraints to interdetermine the point-to-point correspondences. Here, the convolutional feature maps embodying both the local and global information are extracted using a deep feature extractor and are later used to generate the correlation matrix that incorporates the feature matches for every extracted feature point. Subsequently, the probabilistic constraints followed by soft-argmax layer are applied to these established correspondences to link each feature point in the source image with the feature points in the orthomosaic. For further details, refer to our publication [19].
C. Feature Extraction
The pair of images (an example pair shown in Fig. 2) is fed into a custom feature extractor with shared weights. This feature extractor processes the images sequentially and performs convolutions on them. Subsequently, SPP is applied to the feature map, allowing the extraction of features with a broader context. By incorporating SPP with different scales, the extracted features can fully benefit from the global context of the image. Specifically, 32 features are extracted from each point of the feature map for comparison with the feature map of the other image. This approach facilitates enhanced feature matching and disparity calculation between the image pair, leading to more accurate height estimation for the UAV.
D. Disparity Correlation Matrix
After generating the feature maps from both images, they are combined in a manner that preserves the dimensions of the feature map. This is accomplished by concatenating the 32 features from each image, resulting in a total of 64 features at each location of the feature map. The subsequent crucial step involves forming a 4-D tensor, with dimensions one-fourth that of the image, encompassing 64 features from both images and 48 disparity values. Initially, these disparity values are loaded with default values and then 3-D convolutions are employed for further processing. This process ensures the incorporation of essential information from both images and disparity values, allowing for accurate disparity estimation, which is instrumental in height calculation for the UAV.
E. Tridimensional Convolutions and Hierarchical Hourglass Framework
During this step, 3-D convolutions are employed, followed by batch normalization and ReLU activation. This process is iterated multiple times to convolve the network. Consequently, the features at each point in the feature map are reduced from 64 to 32. Subsequently, an hourglass architecture (encoder–decoder) is introduced to learn the maximum context of the feature map. The hourglass network operates with a top-down and bottom-up approach, where the feature map, already reduced to 1/4th of the image size, is further downsized to 1/8th and then 1/16th of the image size, aiming to attain a comprehensive context of the feature map. In the bottom-up process, the feature map is upsampled using transpose convolutions to achieve 1/8th and then 1/4th of the image size. This approach facilitates the acquisition of contextual information for improved height estimation.
Following the hourglass process, 3-D convolutions are performed once more, further reducing the features from 32 to 1. Consequently, the feature map is converted from 3-D to 2-D, with dimensions of 48 × 1/4 H × 1/4 W. This step enhances the feature map's representational power and prepares it for the final stage of height estimation.
F. Planar Convolutions and Dense Connections for Height Estimation
At this stage of the network, the size of the feature map is 48 × 1/4 H × 1/4 W. The objective here is to further reduce the feature map while increasing the number of planes (filters) to gather maximum information for the subsequent fully connected layers at the end. Initially, 2-D convolutions followed by MaxPool are employed to reduce the feature map to 1/8th of its original size while increasing the planes to 64.
This
process of 2-D convolutions is stacked iteratively to gradually reduce
the feature map to a single point and simultaneously increase the number
of planes to 512. Eventually, AvgPool is applied to reduce the feature
map to a single point while maintaining 512 planes, resulting in
dimensions of 512 × 1/64 H × 1/64 W
The next stage entails fully connected layers, where the output of 512 is connected to 256, then 128, followed by 64, and eventually to 32, culminating in a single-output value. This final output represents the estimated height of the UAV from the ground, providing valuable information for 3-D localization and navigation.
G. Integration of Planar Localization and Height Estimation for 3-D Localization
In order to achieve 3-D localization of the UAV, we have integrated a dedicated 2-D localization network with the height estimation network, as depicted in Fig. 1. To ensure optimal performance, a pretrained network on the same geographical area was utilized. Notably, the input data for these networks differ: The height estimation network employs a pair of images with identical dimensions while the 2-D localization network requires an image along with an orthomosaic. Within the 2-D localization network, only one of the images from the image pair is utilized.
Upon combining both networks, we are able to accurately predict the height of the UAV while simultaneously achieving precise localization of the image within the orthomosaic. This seamless integration facilitates comprehensive 3-D localization of the UAV, thereby aiding in its navigational capabilities.
H. Network Architecture
The overall architecture can be divided into three primary components. In the first part, we utilize a custom feature extractor to generate feature maps. Subsequently, in the second part, these feature maps are amalgamated to form a 4-D cost volume, which represents the disparity map. Finally, the third part involves the processing of this 4-D cost volume through multiple convolution layers, culminating in the accurate estimation of the UAV height. This modular approach allows for efficient and effective height determination, enhancing the overall performance of the system.
I. Implementation Details
The implementation details encompass a comprehensive set of hyperparameters utilized during the training process. Initially, the model was trained for 150 epochs. During training, the learning rate was initially set to 0.00001, which was then fine-tuned to 0.000001 for model optimization. Beta values were assigned as 0.9 and 0.999 to optimize the optimization process for AdamW, and the weight decay was set to 0.000001. Adam optimizer was employed initially, but subsequently, the AdamW optimizer was explored. AdamW, being a stochastic gradient descent method, combines adaptive estimation of first and second-order moments with a weight decay method, effectively countering overfitting and yielding favorable outcomes.
To ensure robust results, multiple training strategies were adopted. While the L2 loss was initially utilized, the presence of outliers prompted a transition to L1 loss (absolute error), which proved to be more effective in generating accurate results. The model underwent extensive training with numerous epochs, consistently updating to enhance its performance. In the final iteration, the model was trained for 190 epochs, culminating in the desired outcomes for height estimation and localization of the UAV.
Experimental Evaluation
In this research, comprehensive experiments were conducted to assess the performance of the proposed approach, aiming to simulate real-world conditions encountered in UAV imagery within our test dataset. Extensive investigations into algorithmic details and hyperparameters were also undertaken to identify the optimal configuration of the model concerning efficiency and accuracy. The results of this thorough analysis, referred to as the ablation study, are presented in this section.
A. Data Curation
Data collection emerged as one of the most challenging aspects of this research undertaking. The project necessitated the acquisition of a dataset comprising downward-looking drone camera images exhibiting a minimum overlapping of 75%–80% between consecutive images. Unfortunately, the existing stereo image datasets available in the market mainly consisted of street views or images of objects in confined spaces with limited variations in depth. Consequently, the creation of a custom dataset became imperative to cater to the specific requirements and objectives of this project.
1) Data Acquisition
The data gathering process necessitated the utilization of a professional drone equipped with the capability to fly at various heights, including higher altitudes. In addition, a suitable location was chosen for the drone flights, leading to the selection of the NUST Main Campus, H-12 Islamabad. A total of 10 flights were meticulously planned to cover a diverse range of 10 different heights. The drone employed for data collection was the Phantom 4 Pro, which conducted multiple flights over the NUST Main Campus to collect the necessary data. The dataset encompassed data gathered at 10 distinct heights, commencing from 100 m and ascending in increments of 20 m up to 280 m.
To optimize the drone flights, the area of interest was defined using the Pix4dcapture mobile app [36], and its Grid 2D option was utilized for setting the flight paths. This app provides different options in which if you select the area on map, start point, end point, height, and front/side overlapping, it will automatically send the drone to capture images as per the area selected, as shown in Fig. 3. The consecutive images (with overlapping) were then used as pairs. About, 80% front overlapping (consecutive images) and 20% side overlapping were used to capture maximum images from a specific area (front overlapping is more relevant in consecutive images). The consecutive images as pairs with 80% overlapping were used. As far as “base” distance is concerned, it was automatically adjusted as per the height of the UAV so that overlapping remains the same (80%), and the base distance is not used as a parameter during training. For further insights, refer to Table I for technical specifics of the dataset.
2) Data Preparation
Upon
successful completion of all flights, a total of 1781 images were
collected. However, this dataset was deemed insufficient for the
training of deep CNNs. Consequently, data augmentation techniques were
employed to augment the dataset. During the data augmentation process,
images were subjected to flipping and rotation. Careful consideration
was given to ensure that the dimensions of the images remained
unchanged, thus rotation was limited to 180
3) Dataset Split
In order to achieve optimal results during the training of the network, a conventional 80%–20% split on mixed data was deemed unsuitable due to the dataset containing data of various heights. Mixing data of distinct heights in the training set could lead to suboptimal performance. Therefore, a novel approach was adopted to split the dataset based on the specific height values. By organizing the split in this manner, data of distinct heights could be segregated, ensuring a more effective training process.
a) Training Set
For the training phase, data from eight distinct heights were utilized. In order to ensure a uniform distribution of data, the selected heights were nonconsecutive. The heights used for training the network were as follows: 100, 120, 140, 160, 200, 240, 260, and 280 m. This selection was made to enhance the diversity of the training set and optimize the learning process.
b) Testing Set
For validation and testing purposes, data from two distinct heights was utilized. To ensure a representative evaluation of the model's performance, the heights for the testing set were selected from within the overall distribution, rather than focusing on either the tail or head data. The selected heights for the testing set were 180 and 220 m. This approach aims to provide a comprehensive assessment of the model's generalization capabilities across a range of heights in the dataset.
4) Data Preprocessing
Prior to initiating the training of the network, several preprocessing steps were employed to optimize the training process, conserve time, and save computational resources, including processing RAM. The major steps involved in the preprocessing are outlined as follows.
a) Image Scaling
Considering the high resolution of the original images captured (5472 × 3648), which can be computationally demanding for processing in a CNN, it became essential to address the memory and processing constraints. The large image size not only consumes significant CPU resources but also limits the possibility of using large batch sizes, consequently extending the training time for the network. To mitigate these challenges, a preprocessing step involved scaling the images using interarea interpolation to a more manageable resolution of 512 × 256, aligning with the design of the network. This scaling process optimized the computational efficiency and facilitated faster training without compromising on performance.
b) Image Illumination
Given that the dataset was collected at various times of the day, significant variations in image illumination were observed. Certain images exhibited bright light with accompanying shadows while others were captured under normal lighting conditions. To address the issue of illumination invariance across the dataset, the contrast limited adaptive histogram equalization (CLAHE) technique was applied. By utilizing CLAHE, the luminance channel of the images was equalized, effectively mitigating the illumination discrepancies and enhancing the overall consistency of the dataset. This preprocessing step contributed to the improvement of the network's learning mechanism, resulting in enhanced results during subsequent stages of processing.
c) Nonoverlapping Image Pair Elimination
In order to ensure dataset uniformity and alleviate potential issues arising from insufficient overlapping, a preprocessing step was implemented to remove all image pairs captured at the edges of each flight path with less than 70% overlapping, as depicted in Fig. 3. To facilitate efficient management of the dataset, a comprehensive list of image pairs was compiled. By eliminating nonoverlapping pairs, the dataset was refined to enhance the reliability and effectiveness of subsequent stages in the research process.
d) Week Feature Elimination
To enhance the quality and relevance of the dataset, a final preprocessing step involved the removal of image pairs that lacked prominent features, specifically those captured on plain grounds with minimal distinct characteristics. By eliminating such image pairs, the dataset was refined to include only those with significant features, thus ensuring the efficacy of subsequent processing stages. In addition, all image pairs captured at the edges of each flight path were discarded, retaining only those with a substantial 70%–80% overlapping. A visual comparison of images taken from different heights but the same location is depicted in Fig. 4, further emphasizing the significance of this preprocessing step in curating a robust dataset for analysis and training purposes.
B. Results and Discussion
The proposed approach has been tested in terms of the correctness of height estimated by comparing two overlapping images taken from a UAV camera. In the subsequent paragraphs, a detailed account of the step-by-step process employed to enhance the results will be presented.
1) Luminance Disparities
The dataset comprises images captured at different times of the day, exhibiting varying illumination and lighting conditions. To address these variations and enhance the learning process of the network, CLAHE is employed to equalize the luminance channel of the color images. Equalizing solely the luminance channel is found to be superior to equalizing all the channels of the BGR image. As a result, all the images in the dataset are equalized before being fed into the network, leading to improved results.
2) Model Overadaptation
The
primary challenge lies in the network's complexity due to the
relatively limited size of the dataset. As a consequence, the network
initially demonstrates signs of overfitting to the dataset, leading to
unsatisfactory results on the validation dataset. To mitigate this
issue, data augmentation techniques were employed. Considering the fixed
input dimensions of our network, the images were subjected to 180
3) Preliminary Findings
The initial results, as presented in Table III, exhibited an improvement compared to PSMNet, yet they remained unsatisfactory due to the substantial average error of 10 m during height estimation. The primary objective of this research was to achieve an average error below 5 m. Further analysis revealed inconsistencies in the dataset, attributed to the imagery being captured at particular locations with varying heights during drone flights at the NUST Main Campus. Several issues were identified as follows.
Variation in Heights: Certain areas within the dataset exhibited varying heights, contributing to the overall inconsistency.
Variability in Image Features: Some image pairs lacked prominent features (see Fig. 5) while others contained significant features, introducing challenges in accurate matching (see Fig. 6).
Insufficient Overlapping: In instances where the drone changed its flight direction at specific locations, the image pair overlapping proved inadequate for effective comparison.
4) Conclusive Outcomes
Addressing these identified issues would be essential for enhancing the performance of the proposed approach and achieving the desired accuracy in UAV height estimation. Consequently, a data-cleaning process was employed to identify images lacking prominent features (as depicted in Fig. 5). These identified images, along with their corresponding augmented versions, were subsequently removed from the dataset. Following the completion of this data-cleaning procedure, a notable improvement in overall results was observed, as illustrated in Table III. The implementation of the data-cleaning process not only facilitated faster network convergence but also led to significant improvements in overall results, as depicted in Figs. 7 and 8.
In conjunction with the data-cleaning procedure, continuous updates were made to the network parameters, encompassing shapes and other relevant parameters. As a result of these enhancements, the average error on the validation dataset notably reduced to 4.4 m while on the test dataset, it approached approximately 4.6 m. In Table IV, a comparative analysis of our network, denoted as PSMHENet, was conducted against the Siamese-based image-matching network [35] and the vision-based 3-D localization technique employed by Yol et al. [33]. Moreover, in Fig. 8, the convergence of both networks is illustrated, displaying their improvement in mean absolute error after 190 epochs. Moreover, Fig. 9 also illustrates the scatter plot to visualize the spread and standard deviation of the testing set.
5) Hardware, Running Times, and Evaluation Mechanism
The training of the model was conducted on a Linux machine equipped with a 22-GB GPU. The initial training process extended over several weeks, during which continuous updates were made to the model, along with parameter tuning to optimize its performance. Subsequently, the final epochs required approximately one week to complete. To evaluate the model's performance, the mean absolute height error was calculated.
Conclusion
In this research, we propose a novel approach for estimating the height of UAVs by leveraging a pair of images. We have devised a comprehensive pipeline to process these image pairs and accurately determine the UAV height. The process commenced with a stereo image dataset, where the image pair was input into our network to extract essential features. These features were then utilized for comparison and matching, allowing us to compute the disparity at each location on the feature map. This comparison was conducted using a 4-D tensor to calculate the cost, enabling accurate disparity estimation. Subsequently, we focused on refining the network by applying consecutive convolution and ReLU operations to enhance the output, reduce the feature size, and increase the number of feature maps. Through these improvements, we successfully estimated the height of the UAV from the ground. Following the training phase, we evaluated our network's performance on a separate dataset (test dataset), which consisted of heights not used during training and validation. The achieved results were promising, demonstrating the efficacy of our proposed approach.
Our research offers promising avenues for further enhancing the results and incorporating various improvements. First, ongoing efforts in refining methods and techniques for feature extraction and disparity calculation hold the potential to yield better and more efficient outcomes. Second, utilizing a dataset that encompasses diverse areas, including densely populated regions like the plains of Punjab with minimal height variations, could significantly contribute to improved results. Such areas would provide richer features for matching and disparity calculation, thus enhancing the accuracy and reliability of the estimation process.
No comments:
Post a Comment