Overview of the proposed TransFG. |
TransFG: A Cross-View Geo-Localization of Satellite and UAVs Imagery Pipeline Using Transformer-Based Feature Aggregation and Gradient Guidance
SECTION I. Introduction
Unmanned
aerial vehicles (UAVs) were originally developed through the 20th
century for military missions too “dull, dirty or dangerous” for humans,
and by the 21st, their use expanded to numerous real-life applications.
These include aerial photography, product deliveries, agriculture,
science, disaster relief, policing, surveillance [1], [2], [3], [4], etc. Regardless of the applications, precise navigation [5] is necessary and usually dependent on a global navigation satellite system (GNSS) such as GLONASS
The most important challenge of drone view and preexisting imagery is the significantly differing perspectives between them. Satellite imagery usually captures ground data from a vertical angle, as shown in Fig. 1 (left column), while drones always capture ground data at a slanted angle, as depicted in the right column. The existing image matching algorithms [8], [9], [10] have enabled significant progress to determine the position of UAVs and are now pushing forward the state-of-the-art as well. Even so, current solutions for cross-view of satellite and UAVs imagery are still far from being practically useful, arguably due to the difference in cross-view. Such an (cross-view) image matching suffers from complex distortions, such as lighting and perspective changes, background changes, partial occlusions, and nonrigid distortions of objects appearing in the scene.
The dominant image matching mechanisms are feature point matching [11], [12], [13] and feature space matching [8], [9], which are widely applied in the geo-localization work. The former involves comparing the descriptors of drone images and reference satellite image descriptors, while the latter involves learning to map matching image pairs closer in the feature space and nonmatching image pairs farther apart. These works have shown promising results for UAV localization, yet the issue of viewpoint difference lead to the paucity of feature representation remains. In addition, the number of annotations, the consistency of annotation of cross-view feature points, manual annotation, etc. are all factors that greatly affect the accuracy and efficiency of feature point matching. Feature space matching methods are relatively less restrictive in terms of annotation. In this article, we adopt feature space matching methods to achieve cross-view image matching.
Most previous works [1], [14], [15], [16], [17] addresses the geo-localization using the feature space matching method. The existing work on this task has followed the traditional approach to supervised deep learning methods [14], [15], [16], [17], [18], [19], [20]. Some original methods [21], [22], [23], [24] focus on extracting hand-made features, which are too monotonous to be used to distinguish feature differences presented from diverse perspectives. Inspired by the success of the convolutional neural networks (CNNs) on ImageNet, researchers resort to the deeply learned feature in recent years. More works [15], [16] explore deep neural networks with metric learning to learn the discriminative feature. Specifically, the network is designed to learn a feature space that pulls matched image pairs closer and pushes nonmatched pairs far apart [25], [26], [27]. Feature extraction through CNNs facilitates cross-view image matching, exemplified by prominent benchmarks such as CVM-Net [17], orientation [28], LPN [29], and other main benchmarks. However, images from different views have transformations of position, such as rotation, scaling, and offsetting. Consequently, there are some potential problems in the existing CNN method. First, cross-view image matching requires the extraction of relevant information from contexts. While the attention mechanism and directional information have been widely applied in network design [1], [28], [30], it is noteworthy that the existing CNN-based attention methods primarily concentrate on the central building, incorporating global information through aggregate functions but often overlooking contextual information. Second, the CNN downsampling operation reduces image resolution and destroys fine-grained features of the image. Therefore, it is necessary to understand the semantic information of the global and regional context.,Similarly in the context of natural language processing (NLP), there are cases of attention models [31] to efficiently use the given context information. Inspired by the cases, vision transformer (ViT) [32] first introduced the attention concept, which uses a multiheaded attention mechanism to focus on contextual information. In view of this, ViT as a strong context-sensitive information extractor will play a role in cross-view image matching [10].
We propose a novel and efficient transformer-based pipeline TransFG for cross-view image matching as shown in Fig. 2, which deals with the limitations mentioned above. TransFG incorporates the FA and GG modules. It synergically takes advantage of feature aggregation (FA) and gradient guidance (GG), achieving an effective balance in feature representation and alignment. Specifically, the FA module implicitly learns salient features and dynamically aggregates contextual features from the ViT a step further by weighted average, channel attention, and feature rendering. Meanwhile, the GG module automatically calculates the feature gradient to implement patch-level segmentation and instance-level alignment. Furthermore, we combine triplet loss [33] and center loss [34] (TC-Loss) to bring the data distribution between different view closer. Meanwhile, the symmetric self-distillation JS-Loss [35] is implemented in our task, further improving the performance of cross-view image matching. In summary, the main contributions of this article are as follows.
We propose an innovative transformer-based pipeline Trans-FG for robust cross-view image matching, which incorporates the FA and GG modules. TransFG synergically takes advantage of FA and GG, achieving an effective balance in feature representation and alignment.
To tackle the insufficient extraction of feature representation between different views, the proposed FA module implicitly learns salient features and dynamically aggregates contextual features from the ViT a step further by weighted average, channel attention, and feature rendering.
To address position offset and distance-scale uncertainty, the proposed GG module enhances the model’s understanding of instance distribution. Without additional supervision, the GG module automatically calculates feature gradients to implement patch-level segmentation and instance-level alignment.
To further improve the performance of cross-view image matching, we explore JS-Loss and TC-Loss to bring the data distribution between different view closer. Our approach establishes new state-of-the-art (SOTA) on the large-scale University-1652 dataset with significant improvement for the task of both drone-view target localization and drone navigation.
SECTION II. Related Work
Cross-view geo-location has received significant interest in recent years. For locating the UAV image from satellite database, the existing works primarily draw on techniques from image matching. We briefly review essential cross-view geolocation methods, covering feature extraction, aggregation, alignment, and specific loss functions.
A. Feature Extraction
Initial works on cross-view geo-localization have explored extracting handcrafted features, such as GIS [21] and DT [22]. However, these methods with hand-crafted features struggle to reconcile the significant variations in appearance across different views.
1) CNN-Based Method:
The advancement of CNNs enhances cross-view matching performance. The work by Workman and Jacobs [36] is the first attempt to extract cross-view image features using a pretrained CNN, demonstrating the significance of high-level semantic information and its superiority over hand-crafted features. Workman et al. [37] achieved higher performance by fine-tuning a pretrained CNN to minimize the distance between pairs of different view images. Inspired by the face verification approaches, Lin et al. [38] use a modified Siamese network [39] to optimize network parameters by contrastive loss [26]. Hu et al. [17] insert Net-VLAD [40] into the CNN, which improves the robustness of image feature with large viewpoint changes. Cai et al. [41] integrate the channel and spatial attention modules into the residual network trained with the hard example reweighted triplet loss, which is able to highlight salient features in both the views. In a recent work, Zheng et al. [42] combine image convolutional features and semantic word vectors and introduce attention mechanisms and bilinear techniques to enhance information for multiclass classification tasks. Furthermore, Luo et al. [43] enhance the differential representation between dual-temporal HSIs through multiscale feature fusion and temporal feature learning. Besides, Guo et al. [44] combine CNN and LSTM to extract spatial–spectral features, addressing the issue of insufficient global properties and spectral information in HSI classification. Nevertheless, the inherent local properties of CNNs hinder the analysis and extraction of geospatial information in images. As a result, these methods fail to generate feature discriminative enough to handle drastic viewpoint changes.
2) Transformer-Based Method:
Inspired by the success of ViT, L2LTR [45] and TransGeo [46] apply transformer blocks for modeling global dependencies. Chen et al. [47] design the cross-view transformer for feature mapping and cross-view consistency loss. Furthermore, Yang et al. [45] introduced the self-cross-attention mechanism to enhance feature representation. Dai et al. [10] apply the ViT structure to enhance cross-view geolocation accuracy by focusing on contextual features. Thus, compared with the CNN-based methods, the transformer-based method prioritizes contextual features. So we adopt transformer-based contextual feature extraction for image matching in our study.
B. Feature Aggregation
To obtain more discriminative representation, there have been several efforts focused on the design of FA. Hu and Lee [48] use a dual-branch CNN with independent Net-VLAD layers to better encode local features. Feature representations from both the views are integrated into a shared embedding space for image matching. Sun et al. [49] combine ResNet with the capsule layer to represent high-level semantics. Shi et al. [50] apply the multihead attention module as the FA model to encode spatial information. In our research, we apply weighted averaging, channel attention, and feature rendering to aggregate features, resulting in a robust feature representation.
C. Feature Alignment
The primary objective of these studies is to explicitly use image and feature correspondences, focused on learning viewpoint adaptation and feature alignment to address domain discrepancies in an explicit manner. Liu and Li [28] use the orientation as the supportive information for feature alignment. Shi et al. [15] introduce a feature-transport layer for learning feature transformations, which includes an efficient polar transformation algorithm to align UAV images with others. Emphasizing fine-grained information about different parts supports the model in learning comprehensive cross-viewpoint geo-localization features. Part-based fine-grained features have been demonstrated to be reliable in matching tasks, as shown in various studies [51], [52], [53], [54], [55]. AlignedReID++ [56] automatically aligns slice information without introducing additional supervision to address pedestrian misalignment due to occlusions, view changes, and pose biases. PCB [57] applies a horizontal segmentation approach to extract high-level segmentation features from human body parts. LPN [29] uses a square-ring feature partition strategy based on prealigned datasets with neighboring areas as supplementary information. Dai et al. [10] calculate and align features in different regions of the image. Inspired by these part-based feature alignment work, we propose the GG module that uses gradient segmentation and instance alignment to significantly enhance image matching accuracy.
D. Matching Loss
Metric learning via deep networks is highly relevant for cross-view image matching tasks, which designs different training objectives to learn the discriminative representation. Vo and Hays [58] use an orientation regression loss for performance enhancement. Hu et al. [17] use a weighted soft ranking loss to expedite training convergence and improve matching accuracy. In contrast, Zheng et al. [1] tackle cross-view image matching as a classification task, optimizing the network using instance loss [59] for competitive results. In our work, we use TC-Loss for performance improvement. In addition, we introduce JS-Loss to symmetrically align different data distributions.
SECTION III. Methodology
To enhance the performance of cross-view geolocation, we propose an innovative transformer-based pipeline TransFG, which consists of feature extraction module, FA module, GG module, and supervised learning as shown in Fig. 2. TransFG is highly general in dealing with drone cross-view geo-localization. In Section III-A, we present our feature extraction model, which uses the ViT structure to generate global image feature and local image feature. Detailed in Section III-B, we then describe the proposed FA module, which obtains the more abundant feature representation by several aggregate algorithms. In Section III-C, the proposed GG module as the auxiliary task performs gradient segmentation and instance alignment. Our approach incorporates JS-Loss and TC-Loss, further enhancing cross-view image matching accuracy, as elaborated in Section III-D.
A. Feature Extraction Module
Cross-view image matching requires module to deeply understand feature information, but the local properties inherent in CNNs hinder the analysis and extraction of contextual features. The success of transformer is attributed to the fact that the multiheaded attention mechanism pays effective attention to contextual features. With the increasing popularity of ViT in image fields, we have found that ViT also achieves very good results in cross-view image matching.
Given the input
Extra Learnable Embedding:
The advantage of transformer lies in their stronger ability to extract
contextual information. Therefore, the cls_token, an additional
learnable parameter (Fig. 2), captures context information, serving as the representation for global features, labeled as
Transformer Encoder: The transformer encoder extracts the contextual semantic relationships between each patch, incorporating positional embeddings as input. The output is a feature vector of the same dimension as the original input after multihead attention.
B. FA Module
To tackle the problem caused by insufficient extraction of feature representation, the proposed FA module enhances the robustness of image features performed in cross-views. As shown in Fig. 2, the FA module consists of three parts: initially, local feature optimizer (LFO) assigns global features and then optimizes local features via weighted averaging. Furthermore, the channel attention optimizer (CAO) enhances the robustness of local features through channel attention. Finally, the global feature optimizer (GFO) fuses attention-enhanced local features with global features through feature rendering.
1) Local Feature Optimizer:
Inspired by the successful PCB technique [57], we enhance local feature robustness by integrating global features through weighted averaging. Meanwhile, a novel parameter,
2) Channel Attention Optimizer:
Inspired by SE-Net [60] and the similarity between ViT’s patch and CNN multichannel convolutional feature representation, as shown in Fig. 3, we cast the CAO to the ViT model. This extension enables ViT to effectively capture the correlations between different channel of each patch feature based on the needs of cross-view image matching, thereby enhancing the robustness of ViT’s local features.
By combining the max-pooling and average-pooling methods [61], depicted in Fig. 2, it spatially compresses local features to calculate the channel attention map, denoted as
Here,
3) Global Feature Optimizer:
Motivated by GRU [62],
which uses a gate mechanism to update and forget sequential features,
we propose a GFO to enhance the representation of global features by
feature rendering. It enables global features to selectively forget
irrelevant information while retaining essential details. As shown in Fig. 2, the noval learnable parameters,
Memory weights
Supervised Learning: We regard the classification results obtained by robust TransFA pipeline as supervised information. Furthermore, we apply the cross-entropy loss without label-smooth as the ID loss.
C. GG Module
Previous work has demonstrated that feature segmentation and alignment methods are helpful for image matching, but the existing methods of segmentation are unable to handle the problem caused by position offset and distance-scale uncertainty. Inspired by the success of the part-level segmentation method, we propose the GG module to achieve instance-level feature segmentation and alignment. The GG module consists of the instance segmentation module (ISM) and the instance alignment module (IAM), corresponding to instance segmentation and alignment, respectively.
1) Instance Segmentation Module:
Inspired by the fact that image gradient calculations are good at contour extraction, we apply the ISM to the local features
In the first step, we aggregate local features along the channel dimension to obtain e-values representation as follows:
Here,
In the second step, we enhance gradient descent convergence in our module by regularizing local features
Here,
In the third step, we compute the second-order gradient feature between adjacent local e-values. The experiment demonstrates that the second-order gradient achieves the best results in instance segmentation. It can be formulated as follows:
Here,
After
computing second-order gradient features, we perform instance
segmentation on local features. Initially, we sort these gradient
features in descending order and equally divide patches into
Here,
Fig. 4. (Left
column) Input images from both drone and satellite views of the same
location are shown. (Middle column) Feature maps are generated using
instance segmentation ( (Right column) Feature vectors obtained through average pooling of each instance. |
2) Instance Alignment Module:
To
improve instance recognition, we introduce alignment supervision IAM to
enhance the model’s ability to distinguish instances (see Fig. 4). For each instance class
Here,
D. Loss
1) TC-Loss:
In our model evaluation, we use the Euclidean distance to assess instance similarity. To facilitate domain alignment, we use triplet loss, defined as
Center-loss minimizes feature distances to class centers. It can be defined as follows:
Here,
Given this, we propose TC-Loss, a combination of center loss and triplet loss, denoted as follows:
Here,
2) JS-Loss:
Cross-view geolocation is a multiple input and multiple output task, which mainly accomplishes image matching among different domains. Given this, we introduce the self-distillation method to establish learning relationships and reduce the distance for the same instances among different domains. The KL divergence formula is given as
Here,
To ensure symmetry in the twins’ structure, we use the symmetric JS-Loss for mutual learning
Here,
The overall loss function is defined as follows:
Here,
SECTION IV. Experiment
In our experiments, we evaluate the performance of Trans-FG pipeline for cross-view image matching on the University-1652 dataset and compare it with the state-of-the-art CNN and transformer models in terms of AP and R@1 scores.
A. Experimental Settings
Our model is based on the PyTorch framework, and all the experiments are performed on an Nvidia GTX 3090Ti GPU.
1) In Terms of Network Structure:
We propose an innovative transformer-based pipeline TransFG for robust cross-view image matching. It synergically takes advantage of FA and GG, achieving an effective balance in feature representation and alignment. The FA module implicitly learns salient features and dynamically aggregates contextual features from the ViT. Meanwhile, the GG module automatically calculates feature gradients to implement patch-level segmentation and instance-level alignment.
2) In Terms of Training Strategy:
We
set the initial learning rate of 0.0001 over 140 epochs, with LR decay
at the 80th and 130th epochs (0.1 each). The classifier module used
Kaiming initialization [63]. Data augmentation included resizing to
3) In Terms of the Loss Function:
We use cross-entropy loss for classification. Moreover, we use TC loss (margin = 0.25) to minimize the distance between the same instances in different views. In addition, we use JS divergence loss to bring the data distribution between different view closer.
4) In Terms of Testing:
We use Euclidean distance to calculate the similarity between the query image and the candidate images in the satellite gallery.
B. Datasets and Evaluation Protocols
We conducted extensive experiments using the University-1652 dataset [1],
comprising images from various perspectives. The University-1652
dataset collects 1652 buildings from 72 universities around the world,
which contains three perspective images: satellite, ground, and drone
views. The training set includes 701 buildings from 33 universities,
while the test set contains 951 buildings from 39 universities. The
training data consist of 701 satellite, 37854 drone, and 11640 street
view images, totaling 50195 images. No overlap exists between the
training and test sets. The dataset contains two tasks: drone-view
target localization (Drone
Task 1 [Drone-View Target Localization (Drone
Task 2 [Drone Navigation (Satellite
To evaluate method performance, we use two metrics: Recall@K (R@K) [64] and average precision (AP) [65]. R@K quantifies the likelihood of correct matches within the top-k ranked image results, while AP provides a mean measure of matching performance across multiple ground truths.
C. Comparison With SOTA
In the University-1652 dataset, we evaluate our TransFG in the Satellite
D. Ablation Studies
To
verify the effectiveness of our proposed method, we designed several
comparative experiments, including the effect of the FA module in
TransFG, the effect of the number of GG module parts, the impact of
overlap in functionality among various modules, the effect of TC-Loss
and JS-Loss, the robustness of GG module to position shifting, the
effect of various extreme image conditions, the effect of drone distance
on the geographic target, the effect of gradients on the division of
instances, the effect of transformer-based and CNN-based methods on
cross-view image matching, the effect of the parameter
1) Effect of the FA Module in TransFG:
We propose the FA module to enhance cross-view feature representation. Comparative analysis revealed significant performance improvement with the inclusion of FA module in our pipeline, as presented in Table II. We also evaluated the impact of combining the three FA module components, concluding that the optimal results are achieved when all the three components are used together. Our findings demonstrate the positive influence of FA module on enhancing feature representation for cross-view image matching.
2) Effect of the Number of GG Module Parts:
The
number of divisional instances is an important indicator of the
robustness of our segmentation network model. We experimented to verify
the performance of the number of divided instances, as shown in Table III. The results demonstrate optimal performance when partitioning into
3) Impact of Overlap in Functionality Among Various Modules:
We conducted extensive experiments, and the results are shown in Table IV. Our performed ablation studies tried to verify the effectiveness of each component and their combinations in our proposed scheme (FA, GG, TC-Loss, JS-Loss). For instance, as illustrated in Table IV, the effects of the second, fourth, and sixth rows show that the performance of ViT + FA + GG + TC-Loss > ViT + FA + TC-Loss > ViT + TC-Loss indicated that there is no redundancy between the FA and GG modules. Analogously, by comparing the first row, second row, and fourth row, we discovered that the performance of ViT + FA + TC-Loss > ViT + TC-Loss > ViT + triplet-loss, demonstrated that there is no redundancy between the FA module and TC-Loss. Our incremental combination experiments of each module dedicate to deducting the overall effect of combined modules. Therefore, these results indicated that there is no duplication of effects between modules.
4) Effect of TC-Loss and JS-Loss:
In
the cross-view image matching task, we used TC-Loss and JS-Loss
strategies with a 0.25 margin, enhancing matching performance in Table V. TC-Loss alone improved Drone
5) Robustness of GG Module to Position Shifting:
To
verify the robustness of GG module to position shifting, we compared
the relative position shifting results of GG module and FSRA (2022).
Verification methods for positional shifting include BlackPad (BP) and
FlipPad (FP), as shown in Fig. 5. BP inserts a black block of width
6) Effect of the Input Image Size:
To balance the input image size and memory usage, we conducted experiments on the GG module with a region number of
7) Effect of the Various Extreme Image Condition:
To assess the robustness of our method under various extreme image conditions, as shown in Fig. 6, we applied Gaussian noise, Gaussian blur, low light, and thin cloud simulation data augmentation in the entire test dataset. The specific results shown in Table VIII indicate that while slightly lower compared with our normal results, it still outperformed the normal FSRA (2022). Therefore, our approach still achieved satisfactory results in various extreme conditions.
8) Effect of the Drone Distance on the Geographic Target:
While the scale of the University-1652 satellite view image is fixed, the scale of the drone-view image varies dynamically with the distance of the drone from the geographical target. Based on the distance between the drone and the target building, the University-1652 dataset is divided into three parts: short, medium, and long. We compared the effects of the GG module and FSRA at three different distance levels as shown in Table IX. It has the minimum accuracy at long distances and the greatest accuracy at medium distances. Therefore, the GG module has better scale adaptive capabilities.
9) Effect of Gradients on the Division of Instances:
We conducted ablation experiments on gradient instance division. As shown in Table X, we find the first-order gradient is better than without the gradient, AP and R@1 have 1% improvement. The second-order gradient feature achieves sota results. We can conclude that the use of second-order gradient features can effectively divide instances and improve the accuracy of cross-view image matching.
10) Effect of Transformer-Based and CNN-Based Methods on Cross-View Image Matching:
We conducted comparative experiments on the impact of transformer-based and CNN-based methods on cross-view image matching. We propose an innovative pipeline TransFG, comparing its performance with the transformer-based and CNN-based networks. As shown in Table I, TransFG not only outperforms the existing transformer-based and CNN-based methods in terms of accuracy but also maintains competitive inference times. For example, TransFG outperforms ResNet-50 by 22.92%, ResNet-101 by 19.18%, and Vit-S/16 by 12.53%. TransFG demonstrates that the transformer-based methods surpass the CNN-based approaches by capturing rich contextual features in cross-view image matching. In addition, it demonstrates that TransFG, as an excellent feature extraction structure, can effectively handle cross-view image matching.
11) Effect of the Parameter α
in the LFO:
We evaluated different
12) Effect of the Various ViT Backbone:
There are several variants of ViT, which outperform the basic ViT model. Therefore, we systematically assessed the performance of various transformer models in cross-viewpoint image matching using different ViT variants as the backbone. As shown in Table XII, it is evident that all the ViT variants outperform the basic ViT model. Notably, T2T-ViT exhibits the most promising results. Based on these findings, we are considering adopting T2T-ViT as a replacement for the basic ViT model in future work.
E. Visualization Of Qualitative Result
For the two basic tasks of the University-1652 dataset: drone view target localization and drone Navigation, we visualize some matching results in Fig. 7. For drone-view target localization, we randomly selected eight test drone-view images and successfully retrieved the top five matching images from the gallery set, yielding completely accurate results [Fig. 7(a)]. In the drone navigation task, we similarly selected eight test satellite-view images and obtained perfect matches by selecting the top five similar images from the gallery set [Fig. 7(b)]. These results exemplify the effectiveness of our model in cross-view scenarios, emphasizing its robustness and practicality in real-world contexts.
SECTION V. Conclusion
In this article, we propose an innovative transformer-based pipeline TransFG, which incorporates the FA and GG modules. TransFG synergically takes advantage of FA and GG. The FA module enhances the robustness of feature representation by learning salient features and dynamically aggregating contextual features. Furthermore, the GG module addresses position offset and distance-scale uncertainty through instance-level feature alignment, enhancing the practicality of cross-view image matching. Moreover, we also explore the TC-Loss and JS-Loss to implicitly bring different view data distributions closer, further improving cross-view image matching performance.
Extensive experiments on the University-1652 datasets demonstrate three aspects.
TransFG outperforms the existing methods, which achieves a Recall@1 of 90.16% with an AP of 84.61% in the Satellite
→ Drone task, and 84.01% Recall@1 with an AP of 86.31% in the Drone→ Satellite task.To assess generalizability, we performed Gaussian noise, Gaussian blur, low light, and thin cloud data augmentation in the entire test dataset. The experiments demonstrate that our pipeline also outperforms the current SOTA.
Evidently, TransFG pipeline is overall a better choice for image matching tasks in cross-view geo-localization of satellite and UAVs imagery.
Given the challenging nature of comprehensive geographic datasets, we believe the benefits afforded by our pipeline are a significant step toward the advancement of cross-view geo-localization. We envisage that the proposed effective TransFG pipeline will also have a positive impact on feature representation, as well as on addressing positional shifts and distance-scale uncertainty. In addition, we also endeavor to establish a more densely populated dataset to attain precise localization.
No comments:
Post a Comment