Thursday, February 1, 2024

TransFG: A Cross-View Geo-Localization of Satellite and UAVs Imagery Pipeline Using Transformer-Based Feature Aggregation and Gradient Guidance

Fig. 2. - Overview of the proposed TransFG. Given an image pair (satellite and drone), we use ViT to extract both local and global features. The proposed FA module comprises three components: LFO, CAO, and GFO modules. Global features are used for supervised cross-view image matching, while local features undergo processing through the GG module. The proposed GG module uses ISM to divide local features into instances, and IAM aligns these instance features. In addition, TC-Loss is applied to each branch to minimize the distance between matching feature contents.
Overview of the proposed TransFG.

TransFG: A Cross-View Geo-Localization of Satellite and UAVs Imagery Pipeline Using Transformer-Based Feature Aggregation and Gradient Guidance

10m
IEEE Transactions on Geoscience and Remote Sensing - new TOC

Publisher: IEEE


Abstract:

Cross-view geo-localization of satellite and unmanned aerial vehicles (UAVs) imagery has attracted extensive attention due to its tremendous potential for global navigation satellite system (GNSS) denied navigation. However, inadequate feature representation across different views coupled with positional shifts and distance-scale uncertainty are key challenges. Most of the existing research mainly focused on extracting comprehensive and fine-grained information, yet effective feature representation and alignment should be imposed equal importance. In this article, we propose an innovative transformer-based pipeline TransFG for robust cross-view image matching, which incorporates feature aggregation (FA) and gradient guidance (GG) module. TransFG synergically takes advantage of FA and GG, achieving an effective balance in feature representation and alignment. Specifically, the proposed FA module implicitly learns salient features and dynamically aggregates contextual features from the vision transformer (ViT). The proposed GG module uses the gradient information of local features to further enhance the cross-view feature representation and aligns specific instances across different views. Extensive experiments demonstrate that our pipeline outperforms existing methods in cross-view geo-localization. It achieves an impressive improvement in R@1 and AP than the state-of-the-art (SOTA) methods. The code has been released at https://github.com/happyboy1234/TransFG .
 
Article Sequence Number: 4700912
Date of Publication: 10 January 2024
ISSN Information:
Publisher: IEEE
Funding Agency:

SECTION I. Introduction

Unmanned aerial vehicles (UAVs) were originally developed through the 20th century for military missions too “dull, dirty or dangerous” for humans, and by the 21st, their use expanded to numerous real-life applications. These include aerial photography, product deliveries, agriculture, science, disaster relief, policing, surveillance [1], [2], [3], [4], etc. Regardless of the applications, precise navigation [5] is necessary and usually dependent on a global navigation satellite system (GNSS) such as GLONASS GPS BEIDOU combined geo-localization. Since the signal is vulnerable to intentional jamming and spoofing attacks by an adversary, and naturally susceptible to blockages and reflections in radio signal paths, visual geo-localization has become crucial for UAVs in GNSS-denied environments. The essence of visual geo-localization is cross-view image matching between UAVs and preexisting aerial or satellite imagery [6], [7]. The image matching system which achieves comparable accuracy would be lower the cost and beneficial to a variety of UAV platforms. We therefore concern ourselves with overcoming these external navigation failures by cross-view geo-localization.

The most important challenge of drone view and preexisting imagery is the significantly differing perspectives between them. Satellite imagery usually captures ground data from a vertical angle, as shown in Fig. 1 (left column), while drones always capture ground data at a slanted angle, as depicted in the right column. The existing image matching algorithms [8], [9], [10] have enabled significant progress to determine the position of UAVs and are now pushing forward the state-of-the-art as well. Even so, current solutions for cross-view of satellite and UAVs imagery are still far from being practically useful, arguably due to the difference in cross-view. Such an (cross-view) image matching suffers from complex distortions, such as lighting and perspective changes, background changes, partial occlusions, and nonrigid distortions of objects appearing in the scene.

Fig. 1. - Images from University-1652: (left column) satellite view and (right column) UAV view.
Fig. 1. Images from University-1652: (left column) satellite view and (right column) UAV view.

The dominant image matching mechanisms are feature point matching [11], [12], [13] and feature space matching [8], [9], which are widely applied in the geo-localization work. The former involves comparing the descriptors of drone images and reference satellite image descriptors, while the latter involves learning to map matching image pairs closer in the feature space and nonmatching image pairs farther apart. These works have shown promising results for UAV localization, yet the issue of viewpoint difference lead to the paucity of feature representation remains. In addition, the number of annotations, the consistency of annotation of cross-view feature points, manual annotation, etc. are all factors that greatly affect the accuracy and efficiency of feature point matching. Feature space matching methods are relatively less restrictive in terms of annotation. In this article, we adopt feature space matching methods to achieve cross-view image matching.

Most previous works [1], [14], [15], [16], [17] addresses the geo-localization using the feature space matching method. The existing work on this task has followed the traditional approach to supervised deep learning methods [14], [15], [16], [17], [18], [19], [20]. Some original methods [21], [22], [23], [24] focus on extracting hand-made features, which are too monotonous to be used to distinguish feature differences presented from diverse perspectives. Inspired by the success of the convolutional neural networks (CNNs) on ImageNet, researchers resort to the deeply learned feature in recent years. More works [15], [16] explore deep neural networks with metric learning to learn the discriminative feature. Specifically, the network is designed to learn a feature space that pulls matched image pairs closer and pushes nonmatched pairs far apart [25], [26], [27]. Feature extraction through CNNs facilitates cross-view image matching, exemplified by prominent benchmarks such as CVM-Net [17], orientation [28], LPN [29], and other main benchmarks. However, images from different views have transformations of position, such as rotation, scaling, and offsetting. Consequently, there are some potential problems in the existing CNN method. First, cross-view image matching requires the extraction of relevant information from contexts. While the attention mechanism and directional information have been widely applied in network design [1], [28], [30], it is noteworthy that the existing CNN-based attention methods primarily concentrate on the central building, incorporating global information through aggregate functions but often overlooking contextual information. Second, the CNN downsampling operation reduces image resolution and destroys fine-grained features of the image. Therefore, it is necessary to understand the semantic information of the global and regional context.,Similarly in the context of natural language processing (NLP), there are cases of attention models [31] to efficiently use the given context information. Inspired by the cases, vision transformer (ViT) [32] first introduced the attention concept, which uses a multiheaded attention mechanism to focus on contextual information. In view of this, ViT as a strong context-sensitive information extractor will play a role in cross-view image matching [10].

We propose a novel and efficient transformer-based pipeline TransFG for cross-view image matching as shown in Fig. 2, which deals with the limitations mentioned above. TransFG incorporates the FA and GG modules. It synergically takes advantage of feature aggregation (FA) and gradient guidance (GG), achieving an effective balance in feature representation and alignment. Specifically, the FA module implicitly learns salient features and dynamically aggregates contextual features from the ViT a step further by weighted average, channel attention, and feature rendering. Meanwhile, the GG module automatically calculates the feature gradient to implement patch-level segmentation and instance-level alignment. Furthermore, we combine triplet loss [33] and center loss [34] (TC-Loss) to bring the data distribution between different view closer. Meanwhile, the symmetric self-distillation JS-Loss [35] is implemented in our task, further improving the performance of cross-view image matching. In summary, the main contributions of this article are as follows.

  1. We propose an innovative transformer-based pipeline Trans-FG for robust cross-view image matching, which incorporates the FA and GG modules. TransFG synergically takes advantage of FA and GG, achieving an effective balance in feature representation and alignment.

  2. To tackle the insufficient extraction of feature representation between different views, the proposed FA module implicitly learns salient features and dynamically aggregates contextual features from the ViT a step further by weighted average, channel attention, and feature rendering.

  3. To address position offset and distance-scale uncertainty, the proposed GG module enhances the model’s understanding of instance distribution. Without additional supervision, the GG module automatically calculates feature gradients to implement patch-level segmentation and instance-level alignment.

  4. To further improve the performance of cross-view image matching, we explore JS-Loss and TC-Loss to bring the data distribution between different view closer. Our approach establishes new state-of-the-art (SOTA) on the large-scale University-1652 dataset with significant improvement for the task of both drone-view target localization and drone navigation.

Fig. 2. - Overview of the proposed TransFG. Given an image pair (satellite and drone), we use ViT to extract both local and global features. The proposed FA module comprises three components: LFO, CAO, and GFO modules. Global features are used for supervised cross-view image matching, while local features undergo processing through the GG module. The proposed GG module uses ISM to divide local features into instances, and IAM aligns these instance features. In addition, TC-Loss is applied to each branch to minimize the distance between matching feature contents.
Fig. 2. Overview of the proposed TransFG. Given an image pair (satellite and drone), we use ViT to extract both local and global features. The proposed FA module comprises three components: LFO, CAO, and GFO modules. Global features are used for supervised cross-view image matching, while local features undergo processing through the GG module. The proposed GG module uses ISM to divide local features into instances, and IAM aligns these instance features. In addition, TC-Loss is applied to each branch to minimize the distance between matching feature contents.

SECTION II. Related Work

Cross-view geo-location has received significant interest in recent years. For locating the UAV image from satellite database, the existing works primarily draw on techniques from image matching. We briefly review essential cross-view geolocation methods, covering feature extraction, aggregation, alignment, and specific loss functions.

A. Feature Extraction

Initial works on cross-view geo-localization have explored extracting handcrafted features, such as GIS [21] and DT [22]. However, these methods with hand-crafted features struggle to reconcile the significant variations in appearance across different views.

1) CNN-Based Method:

The advancement of CNNs enhances cross-view matching performance. The work by Workman and Jacobs [36] is the first attempt to extract cross-view image features using a pretrained CNN, demonstrating the significance of high-level semantic information and its superiority over hand-crafted features. Workman et al. [37] achieved higher performance by fine-tuning a pretrained CNN to minimize the distance between pairs of different view images. Inspired by the face verification approaches, Lin et al. [38] use a modified Siamese network [39] to optimize network parameters by contrastive loss [26]. Hu et al. [17] insert Net-VLAD [40] into the CNN, which improves the robustness of image feature with large viewpoint changes. Cai et al. [41] integrate the channel and spatial attention modules into the residual network trained with the hard example reweighted triplet loss, which is able to highlight salient features in both the views. In a recent work, Zheng et al. [42] combine image convolutional features and semantic word vectors and introduce attention mechanisms and bilinear techniques to enhance information for multiclass classification tasks. Furthermore, Luo et al. [43] enhance the differential representation between dual-temporal HSIs through multiscale feature fusion and temporal feature learning. Besides, Guo et al. [44] combine CNN and LSTM to extract spatial–spectral features, addressing the issue of insufficient global properties and spectral information in HSI classification. Nevertheless, the inherent local properties of CNNs hinder the analysis and extraction of geospatial information in images. As a result, these methods fail to generate feature discriminative enough to handle drastic viewpoint changes.

2) Transformer-Based Method:

Inspired by the success of ViT, L2LTR [45] and TransGeo [46] apply transformer blocks for modeling global dependencies. Chen et al. [47] design the cross-view transformer for feature mapping and cross-view consistency loss. Furthermore, Yang et al. [45] introduced the self-cross-attention mechanism to enhance feature representation. Dai et al. [10] apply the ViT structure to enhance cross-view geolocation accuracy by focusing on contextual features. Thus, compared with the CNN-based methods, the transformer-based method prioritizes contextual features. So we adopt transformer-based contextual feature extraction for image matching in our study.

B. Feature Aggregation

To obtain more discriminative representation, there have been several efforts focused on the design of FA. Hu and Lee [48] use a dual-branch CNN with independent Net-VLAD layers to better encode local features. Feature representations from both the views are integrated into a shared embedding space for image matching. Sun et al. [49] combine ResNet with the capsule layer to represent high-level semantics. Shi et al. [50] apply the multihead attention module as the FA model to encode spatial information. In our research, we apply weighted averaging, channel attention, and feature rendering to aggregate features, resulting in a robust feature representation.

C. Feature Alignment

The primary objective of these studies is to explicitly use image and feature correspondences, focused on learning viewpoint adaptation and feature alignment to address domain discrepancies in an explicit manner. Liu and Li [28] use the orientation as the supportive information for feature alignment. Shi et al. [15] introduce a feature-transport layer for learning feature transformations, which includes an efficient polar transformation algorithm to align UAV images with others. Emphasizing fine-grained information about different parts supports the model in learning comprehensive cross-viewpoint geo-localization features. Part-based fine-grained features have been demonstrated to be reliable in matching tasks, as shown in various studies [51], [52], [53], [54], [55]. AlignedReID++ [56] automatically aligns slice information without introducing additional supervision to address pedestrian misalignment due to occlusions, view changes, and pose biases. PCB [57] applies a horizontal segmentation approach to extract high-level segmentation features from human body parts. LPN [29] uses a square-ring feature partition strategy based on prealigned datasets with neighboring areas as supplementary information. Dai et al. [10] calculate and align features in different regions of the image. Inspired by these part-based feature alignment work, we propose the GG module that uses gradient segmentation and instance alignment to significantly enhance image matching accuracy.

D. Matching Loss

Metric learning via deep networks is highly relevant for cross-view image matching tasks, which designs different training objectives to learn the discriminative representation. Vo and Hays [58] use an orientation regression loss for performance enhancement. Hu et al. [17] use a weighted soft ranking loss to expedite training convergence and improve matching accuracy. In contrast, Zheng et al. [1] tackle cross-view image matching as a classification task, optimizing the network using instance loss [59] for competitive results. In our work, we use TC-Loss for performance improvement. In addition, we introduce JS-Loss to symmetrically align different data distributions.

SECTION III. Methodology

To enhance the performance of cross-view geolocation, we propose an innovative transformer-based pipeline TransFG, which consists of feature extraction module, FA module, GG module, and supervised learning as shown in Fig. 2. TransFG is highly general in dealing with drone cross-view geo-localization. In Section III-A, we present our feature extraction model, which uses the ViT structure to generate global image feature and local image feature. Detailed in Section III-B, we then describe the proposed FA module, which obtains the more abundant feature representation by several aggregate algorithms. In Section III-C, the proposed GG module as the auxiliary task performs gradient segmentation and instance alignment. Our approach incorporates JS-Loss and TC-Loss, further enhancing cross-view image matching accuracy, as elaborated in Section III-D.

A. Feature Extraction Module

Cross-view image matching requires module to deeply understand feature information, but the local properties inherent in CNNs hinder the analysis and extraction of contextual features. The success of transformer is attributed to the fact that the multiheaded attention mechanism pays effective attention to contextual features. With the increasing popularity of ViT in image fields, we have found that ViT also achieves very good results in cross-view image matching.

Given the input xRHWC , where H,W,C represent the height, width, and channels, respectively. We then divided into N fixed-size patches {Lii=1,2,,N} , which flatten into a sequence that captures the local information to extract local features. The output sequence serves as local image feature (L ), while an additional learnable embedding token, represented as xcls , is integrated into spatial information to extract robust global image features (G ). Moreover, positional information is incorporated into each patch through learnable positional embeddings. The input sequence is denoted as follows:

Z0=[G,L0,L1,,LN]+p(1)
View SourceRight-click on figure for MathML and additional features. where Z0 represents input sequence embeddings. pR(N+1)D is the position embeddings. L is the linear projection, which maps the patches to the D dimension. The multiheaded attention mechanism of the transformer makes each transformer layer have global insight, which overcomes the limitations of the receptive domain of the CNN. Meanwhile, no further downsampling operations are required, so the fine-grained features of the image are well-extracted.

Extra Learnable Embedding: The advantage of transformer lies in their stronger ability to extract contextual information. Therefore, the cls_token, an additional learnable parameter (Fig. 2), captures context information, serving as the representation for global features, labeled as G . In addition, the sequence of N fixed-size patches in the output characterizes local features, referred to as L .

Transformer Encoder: The transformer encoder extracts the contextual semantic relationships between each patch, incorporating positional embeddings as input. The output is a feature vector of the same dimension as the original input after multihead attention.

B. FA Module

To tackle the problem caused by insufficient extraction of feature representation, the proposed FA module enhances the robustness of image features performed in cross-views. As shown in Fig. 2, the FA module consists of three parts: initially, local feature optimizer (LFO) assigns global features and then optimizes local features via weighted averaging. Furthermore, the channel attention optimizer (CAO) enhances the robustness of local features through channel attention. Finally, the global feature optimizer (GFO) fuses attention-enhanced local features with global features through feature rendering.

1) Local Feature Optimizer:

Inspired by the successful PCB technique [57], we enhance local feature robustness by integrating global features through weighted averaging. Meanwhile, a novel parameter, α , is introduced to control the influence range of global features on local ones. The LFO, depicted in Fig. 2, blends global features (G ) and local features (L ) using the equation below

Loutput_LFO_i=Li+αG1+α,i=1,,N(2)
View SourceRight-click on figure for MathML and additional features. where N is the length of local features. The experimental results show that α = 8 yields the best performance.

2) Channel Attention Optimizer:

Inspired by SE-Net [60] and the similarity between ViT’s patch and CNN multichannel convolutional feature representation, as shown in Fig. 3, we cast the CAO to the ViT model. This extension enables ViT to effectively capture the correlations between different channel of each patch feature based on the needs of cross-view image matching, thereby enhancing the robustness of ViT’s local features.

Fig. 3. - Similarity between ViT’s patch and CNN multichannel convolutional feature representation.
Fig. 3. Similarity between ViT’s patch and CNN multichannel convolutional feature representation.

By combining the max-pooling and average-pooling methods [61], depicted in Fig. 2, it spatially compresses local features to calculate the channel attention map, denoted as Mc

Mc(F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F))).(3)
View SourceRight-click on figure for MathML and additional features.

Here, σ represents the sigmoid function, and the MLP weights (W0 and W1 ) are shared. Each channel has unique weights, which are multiplied by the channel attention map Mc to obtain the final local features

Loutput_CAO_i=Mc(Loutput_LFO_i)Loutput_LFO_i,i(1,N).(4)
View SourceRight-click on figure for MathML and additional features.

3) Global Feature Optimizer:

Motivated by GRU [62], which uses a gate mechanism to update and forget sequential features, we propose a GFO to enhance the representation of global features by feature rendering. It enables global features to selectively forget irrelevant information while retaining essential details. As shown in Fig. 2, the noval learnable parameters, λ0 (akin to an update gate) for global features and λi (akin to an forget gate) for local features, where λ represents memory weights for both

λ0λiλ=W0G+b0=WiLoutput_CAO_i+bi,i(1,N)=λ0+λi,i(1,N).(5)(6)(7)
View SourceRight-click on figure for MathML and additional features.

Memory weights λ are applied to global feature G to dynamically update crucial features and forget irrelevant ones

Goutput_GDO=G(1λ)+Loutput_CAO_iλ,i(1,N).(8)
View SourceRight-click on figure for MathML and additional features.

Supervised Learning: We regard the classification results obtained by robust TransFA pipeline as supervised information. Furthermore, we apply the cross-entropy loss without label-smooth as the ID loss.

C. GG Module

Previous work has demonstrated that feature segmentation and alignment methods are helpful for image matching, but the existing methods of segmentation are unable to handle the problem caused by position offset and distance-scale uncertainty. Inspired by the success of the part-level segmentation method, we propose the GG module to achieve instance-level feature segmentation and alignment. The GG module consists of the instance segmentation module (ISM) and the instance alignment module (IAM), corresponding to instance segmentation and alignment, respectively.

1) Instance Segmentation Module:

Inspired by the fact that image gradient calculations are good at contour extraction, we apply the ISM to the local features LiRBND (where B stands for the batch size, N stands for the patch size, and D stands for the length of the local features corresponding to each patch).

In the first step, we aggregate local features along the channel dimension to obtain e-values representation as follows:

Loutput_ISM1_i=j=1DLjoutput_CAO_i,i(1,N).(9)
View SourceRight-click on figure for MathML and additional features.

Here, Loutput_ISM1_i denotes the e-values for the i th patch, and Ljoutput_CAO_i represents the j th element of local feature corresponding to the i th patch. Post this operation, the dimension of Loutput_ISM1_i is BN1 .

In the second step, we enhance gradient descent convergence in our module by regularizing local features

Loutput_ISM2_i=Loutput_ISM1_iμσ,i(1,N).(10)
View SourceRight-click on figure for MathML and additional features.

Here, μ represents the mean and σ the variance of all the e-values. This regularization strategy expedites convergence, promoting efficient optimization.

In the third step, we compute the second-order gradient feature between adjacent local e-values. The experiment demonstrates that the second-order gradient achieves the best results in instance segmentation. It can be formulated as follows:

Pi=Loutput_ISM2_i+1Loutput_ISM2_i(Loutput_ISM2_i)2(Loutput_ISM2_i)2i=1,2,,(N1).(11)
View SourceRight-click on figure for MathML and additional features.

Here, Pi denotes the second-order gradient feature of the i th patch.

After computing second-order gradient features, we perform instance segmentation on local features. Initially, we sort these gradient features in descending order and equally divide patches into n instances, where n is a hyperparameter. The partitioning is defined as follows:

Ni=Nn,N(n1)Nn,i=1,2,,n1i=n.(12)
View SourceRight-click on figure for MathML and additional features.

Here, Ni represents the number of patches for the “i th” instance, and . denotes the floor function. We categorize local features Li into n classes, depicted in Fig. 4 (n=2 ), each corresponding to an instance. Furthermore, we assign a unique category label to each instance.

Fig. 4. - (Left column) Input images from both drone and satellite views of the same location are shown. (Middle column) Feature maps are generated using instance segmentation (
$n=2$
). (Right column) Feature vectors obtained through average pooling of each instance.
Fig. 4. (Left column) Input images from both drone and satellite views of the same location are shown. (Middle column) Feature maps are generated using instance segmentation (n=2 ).
(Right column) Feature vectors obtained through average pooling of each instance.

2) Instance Alignment Module:

To improve instance recognition, we introduce alignment supervision IAM to enhance the model’s ability to distinguish instances (see Fig. 4). For each instance class i , represented as fi , average pooling is applied to obtain feature content ViRBNiS , where B is the batch size, Ni is the number of patches in class i , and S is the feature dimension. The equation describing this process is as follows:

Vi=1Nij=1Nifji,i={1,2,,n}.(13)
View SourceRight-click on figure for MathML and additional features.

Here, n represents the number of instance classes, and fji denotes the feature vector of the j th patch in the i th instance class. In summary, Vi is computed by averaging the feature vectors of all the patches within each instance class to achieve feature alignment. We first obtain instance class representations and then pass each instance feature through a ClassifierLayer. Finally, we apply TC-Loss across all the classes to minimize interinstance distances, as depicted in Fig. 2.

D. Loss

1) TC-Loss:

In our model evaluation, we use the Euclidean distance to assess instance similarity. To facilitate domain alignment, we use triplet loss, defined as

LTrid(a,x)=d(a,p)d(a,n)+M+=ax2(14)(15)
View SourceRight-click on figure for MathML and additional features. where d(a,x) represents the Euclidean distance (2 -norm) between vectors a and x , and + denotes the ReLU operation. We set the margin M to 0.25 in all the experiments.

Center-loss minimizes feature distances to class centers. It can be defined as follows:

Lcenter=12i=1mLtcyt22.(16)
View SourceRight-click on figure for MathML and additional features.

Here, cyt is the center for class yt , and Lt is the pre-fully connected layer feature.

Given this, we propose TC-Loss, a combination of center loss and triplet loss, denoted as follows:

LTC=λLTriplet+Lcenter.
View SourceRight-click on figure for MathML and additional features.

Here, λ (experimentally validated as 0.2) balances the two losses.

2) JS-Loss:

Cross-view geolocation is a multiple input and multiple output task, which mainly accomplishes image matching among different domains. Given this, we introduce the self-distillation method to establish learning relationships and reduce the distance for the same instances among different domains. The KL divergence formula is given as

KL Div(O1O2)p(xi)=i=1Np(Oi1)log(p(Oi1)q(Oi2))=log(exijexj)q(xi)=exijexj.(17)(18)
View SourceRight-click on figure for MathML and additional features.

Here, O1 is the target output, and O2 is the model’s output. This formula quantifies the divergence between these distributions.

To ensure symmetry in the twins’ structure, we use the symmetric JS-Loss for mutual learning

JS(PQ)=12KL(PP+Q2)+12KL(QP+Q2).(19)
View SourceRight-click on figure for MathML and additional features.

Here, P represents the output of the drone-view image, and Q represents the output of the satellite-view image after forward propagation.

The overall loss function is defined as follows:

Ltotal=LCE+λLTC+LJS.(20)
View SourceRight-click on figure for MathML and additional features.

Here, LCE represents the cross-entropy loss, LTC denotes the total correlation loss weighted by λ=0.2 , and LJS is the JS divergence loss.

SECTION IV. Experiment

In our experiments, we evaluate the performance of Trans-FG pipeline for cross-view image matching on the University-1652 dataset and compare it with the state-of-the-art CNN and transformer models in terms of AP and R@1 scores.

A. Experimental Settings

Our model is based on the PyTorch framework, and all the experiments are performed on an Nvidia GTX 3090Ti GPU.

1) In Terms of Network Structure:

We propose an innovative transformer-based pipeline TransFG for robust cross-view image matching. It synergically takes advantage of FA and GG, achieving an effective balance in feature representation and alignment. The FA module implicitly learns salient features and dynamically aggregates contextual features from the ViT. Meanwhile, the GG module automatically calculates feature gradients to implement patch-level segmentation and instance-level alignment.

2) In Terms of Training Strategy:

We set the initial learning rate of 0.0001 over 140 epochs, with LR decay at the 80th and 130th epochs (0.1 each). The classifier module used Kaiming initialization [63]. Data augmentation included resizing to 256×256 , random padding, inversion, and cropping. Stochastic gradient descent was used with batch size 8, momentum 0.9, and weight decay 0.0005 for optimization.

3) In Terms of the Loss Function:

We use cross-entropy loss for classification. Moreover, we use TC loss (margin = 0.25) to minimize the distance between the same instances in different views. In addition, we use JS divergence loss to bring the data distribution between different view closer.

4) In Terms of Testing:

We use Euclidean distance to calculate the similarity between the query image and the candidate images in the satellite gallery.

B. Datasets and Evaluation Protocols

We conducted extensive experiments using the University-1652 dataset [1], comprising images from various perspectives. The University-1652 dataset collects 1652 buildings from 72 universities around the world, which contains three perspective images: satellite, ground, and drone views. The training set includes 701 buildings from 33 universities, while the test set contains 951 buildings from 39 universities. The training data consist of 701 satellite, 37854 drone, and 11640 street view images, totaling 50195 images. No overlap exists between the training and test sets. The dataset contains two tasks: drone-view target localization (Drone Satellite) and drone navigation (Satellite Drone).

Task 1 [Drone-View Target Localization (Drone Satellite)]: Given one drone-view image or video, the task aims to find the most similar satellite-view image to localize the target building in the satellite view.

Task 2 [Drone Navigation (Satellite Drone)]: Given one satellite-view image, the drone intends to find the most relevant place (drone view images) that it has passed by. According to its flight history, the drone could be navigated back to the target place.

To evaluate method performance, we use two metrics: Recall@K (R@K) [64] and average precision (AP) [65]. R@K quantifies the likelihood of correct matches within the top-k ranked image results, while AP provides a mean measure of matching performance across multiple ground truths.

C. Comparison With SOTA

In the University-1652 dataset, we evaluate our TransFG in the Satellite Drone and Drone Satellite tasks in Table I. In the Satellite Drone task, our approach achieves 90.16% Recall@1 and 84.61% AP. In the Drone Satellite task, we attain 84.01% Recall@1 and 86.31% AP. Our experiments are only trained on drone-view and satellite-view images, which outperform the SOTA methods, including a 3% improvement in Recall@1 and AP for Satellite Drone and a 2% improvement for Drone Satellite tasks compared with FSRA (SOTA-2022). Meanwhile, we used the running time for ResNet-50 model as the baseline, as shown in Table I. The FSRA and our scheme TransFG run 1.37 and 1.35 times as long as the baseline, respectively. These results indicate that our scheme TransFG not only surpasses FSRA in terms of accuracy but also achieves competitive efficiency.

TABLE I Comparison With SOTA Results Which Have Been Reported on University-1652
Table I- 
Comparison With SOTA Results Which Have Been Reported on University-1652

D. Ablation Studies

To verify the effectiveness of our proposed method, we designed several comparative experiments, including the effect of the FA module in TransFG, the effect of the number of GG module parts, the impact of overlap in functionality among various modules, the effect of TC-Loss and JS-Loss, the robustness of GG module to position shifting, the effect of various extreme image conditions, the effect of drone distance on the geographic target, the effect of gradients on the division of instances, the effect of transformer-based and CNN-based methods on cross-view image matching, the effect of the parameter α in the LFO, and the effect of various ViT backbones.

1) Effect of the FA Module in TransFG:

We propose the FA module to enhance cross-view feature representation. Comparative analysis revealed significant performance improvement with the inclusion of FA module in our pipeline, as presented in Table II. We also evaluated the impact of combining the three FA module components, concluding that the optimal results are achieved when all the three components are used together. Our findings demonstrate the positive influence of FA module on enhancing feature representation for cross-view image matching.

TABLE II Ablation Study to Verify the Robustness of the Proposed FA
Table II- 
Ablation Study to Verify the Robustness of the Proposed FA

2) Effect of the Number of GG Module Parts:

The number of divisional instances is an important indicator of the robustness of our segmentation network model. We experimented to verify the performance of the number of divided instances, as shown in Table III. The results demonstrate optimal performance when partitioning into n = 2 instances (Fig. 4). For n = 1, the GG module aligns with the global ViT branch with global average pooling. For n = 2, the GG module classifies into foreground and background, yielding the best results. For n = 3, the GG module divides images into buildings, roads, and trees, achieving good performance.

TABLE III Ablation Study to Verify the Effect of the Number of Instances. The Experimental Results are Based on the Image_Size = 256
Table III- 
Ablation Study to Verify the Effect of the Number of Instances. The Experimental Results are Based on the Image_Size = 256

3) Impact of Overlap in Functionality Among Various Modules:

We conducted extensive experiments, and the results are shown in Table IV. Our performed ablation studies tried to verify the effectiveness of each component and their combinations in our proposed scheme (FA, GG, TC-Loss, JS-Loss). For instance, as illustrated in Table IV, the effects of the second, fourth, and sixth rows show that the performance of ViT + FA + GG + TC-Loss > ViT + FA + TC-Loss > ViT + TC-Loss indicated that there is no redundancy between the FA and GG modules. Analogously, by comparing the first row, second row, and fourth row, we discovered that the performance of ViT + FA + TC-Loss > ViT + TC-Loss > ViT + triplet-loss, demonstrated that there is no redundancy between the FA module and TC-Loss. Our incremental combination experiments of each module dedicate to deducting the overall effect of combined modules. Therefore, these results indicated that there is no duplication of effects between modules.

TABLE IV Ablation Experiment Aimed to Substantiate the Efficacy of Combining Various Modules
Table IV- 
Ablation Experiment Aimed to Substantiate the Efficacy of Combining Various Modules

4) Effect of TC-Loss and JS-Loss:

In the cross-view image matching task, we used TC-Loss and JS-Loss strategies with a 0.25 margin, enhancing matching performance in Table V. TC-Loss alone improved Drone Satellite/Satellite Drone task by 1.73%/3.71% AP and 2.04%/2.1% R@1, while JS-Loss alone improved it by 2.81%/3.18% AP and 1.06%/0.85% R@1. Combining TC-Loss and JS-Loss yielded a 3.01%/4.74% AP and 2.82%/3.14% R@1 improvement. This indicates that the integration of TC-Loss and JS-Loss significantly enhances cross-view image matching accuracy.

TABLE V Ablation Study to Verify the Effects of the Loss. The Experimental Results are Based on the Triplet-Center Loss With Margin = 0.25, the Weighting Factor λ = 0.2
Table V- 
Ablation Study to Verify the Effects of the Loss. The Experimental Results are Based on the Triplet-Center Loss With Margin = 0.25, the Weighting Factor 
$\lambda$
 = 0.2

5) Robustness of GG Module to Position Shifting:

To verify the robustness of GG module to position shifting, we compared the relative position shifting results of GG module and FSRA (2022). Verification methods for positional shifting include BlackPad (BP) and FlipPad (FP), as shown in Fig. 5. BP inserts a black block of width P on the left and crops P width on the right, while FP flips the left P -width Section and crops the right P -width section. To verify the anti-offset of GG module, we compare the GG module and FSRA. The comparison results are shown in Table VI. When the padding size P increases, the accuracy of GG module decreases much slower than FSRA. In addition, the accuracy of BP decreases more slowly than FP, which may be caused by the FP adding confusing information at the edges resulting in uneven distribution of content. In Table VI, when BP = 60, the AP of FSRA was reduced by 29.08%, while the AP of GG module was reduced by 15.88%, which is about 14 points less than that of FSRA. When FP = 60, the AP of FSRA was reduced by 26.81%, while the AP of GG module was reduced by 19.50%, which is about 7.3 points less than that of FSRA. These results confirm GG module’s superior anti-offset performance.

TABLE VI Conducted Ablation Study on BP and FP, Evaluating GG and FSRA Models for AP Accuracy Across Various Pad Sizes and Decline Rates
Table VI- 
Conducted Ablation Study on BP and FP, Evaluating GG and FSRA Models for AP Accuracy Across Various Pad Sizes and Decline Rates
Fig. 5. - Original drone image is on the left. The middle image has 20 pixels of black padding on the left and is cropped on the right. The right image has a mirrored and padded 20-pixel portion on the left, cropped on the right. The red dotted line represents the padding division.
Fig. 5.

Original drone image is on the left. The middle image has 20 pixels of black padding on the left and is cropped on the right. The right image has a mirrored and padded 20-pixel portion on the left, cropped on the right. The red dotted line represents the padding division.

6) Effect of the Input Image Size:

To balance the input image size and memory usage, we conducted experiments on the GG module with a region number of n = 2. Depending on the input size, the experimental results are shown in Table VII. We observed progressive performance improvements when the input image size was increased from 224 to 512 and considerable improvements in AP when the image input size was changed from 256 to 320. With limited hardware resources, we hope that this ablation experiment will serve as a reference for the selection of an appropriate input image size.

TABLE VII Analyze Input Size Effects on University-1652
Table VII- 
Analyze Input Size Effects on University-1652

7) Effect of the Various Extreme Image Condition:

To assess the robustness of our method under various extreme image conditions, as shown in Fig. 6, we applied Gaussian noise, Gaussian blur, low light, and thin cloud simulation data augmentation in the entire test dataset. The specific results shown in Table VIII indicate that while slightly lower compared with our normal results, it still outperformed the normal FSRA (2022). Therefore, our approach still achieved satisfactory results in various extreme conditions.

TABLE VIII Impact and Effectiveness of Varying Image Conditions on Cross-View Geographic Localization
Table VIII- 
Impact and Effectiveness of Varying Image Conditions on Cross-View Geographic Localization
Fig. 6. - Various data augmentation techniques are used to simulate distinct image conditions.
Fig. 6.

Various data augmentation techniques are used to simulate distinct image conditions.

8) Effect of the Drone Distance on the Geographic Target:

While the scale of the University-1652 satellite view image is fixed, the scale of the drone-view image varies dynamically with the distance of the drone from the geographical target. Based on the distance between the drone and the target building, the University-1652 dataset is divided into three parts: short, medium, and long. We compared the effects of the GG module and FSRA at three different distance levels as shown in Table IX. It has the minimum accuracy at long distances and the greatest accuracy at medium distances. Therefore, the GG module has better scale adaptive capabilities.

TABLE IX Robustness Assessment of the Proposed Framework at Varying Drone-to-Target Distances on University-1652
Table IX- 
Robustness Assessment of the Proposed Framework at Varying Drone-to-Target Distances on University-1652

9) Effect of Gradients on the Division of Instances:

We conducted ablation experiments on gradient instance division. As shown in Table X, we find the first-order gradient is better than without the gradient, AP and R@1 have 1% improvement. The second-order gradient feature achieves sota results. We can conclude that the use of second-order gradient features can effectively divide instances and improve the accuracy of cross-view image matching.

TABLE X Ablation Study on the Effect of Different Gradient Division Instances on Image Features
Table X- 
Ablation Study on the Effect of Different Gradient Division Instances on Image Features

10) Effect of Transformer-Based and CNN-Based Methods on Cross-View Image Matching:

We conducted comparative experiments on the impact of transformer-based and CNN-based methods on cross-view image matching. We propose an innovative pipeline TransFG, comparing its performance with the transformer-based and CNN-based networks. As shown in Table I, TransFG not only outperforms the existing transformer-based and CNN-based methods in terms of accuracy but also maintains competitive inference times. For example, TransFG outperforms ResNet-50 by 22.92%, ResNet-101 by 19.18%, and Vit-S/16 by 12.53%. TransFG demonstrates that the transformer-based methods surpass the CNN-based approaches by capturing rich contextual features in cross-view image matching. In addition, it demonstrates that TransFG, as an excellent feature extraction structure, can effectively handle cross-view image matching.

11) Effect of the Parameter α in the LFO:

We evaluated different α values in the LFO and found that α=8 optimally aligns local features with the central 80% range of global features, yielding the best results (Table XI). In addition, we infer that appropriate global feature to guide local features is beneficial in obtaining a good feature representation.

TABLE XI Ablation Study to Verify the Parameter α in the LFO

12) Effect of the Various ViT Backbone:

There are several variants of ViT, which outperform the basic ViT model. Therefore, we systematically assessed the performance of various transformer models in cross-viewpoint image matching using different ViT variants as the backbone. As shown in Table XII, it is evident that all the ViT variants outperform the basic ViT model. Notably, T2T-ViT exhibits the most promising results. Based on these findings, we are considering adopting T2T-ViT as a replacement for the basic ViT model in future work.

TABLE XII Ablation Experiments of ViT Variant Models on Cross-View Image Matching
Table XII- 
Ablation Experiments of ViT Variant Models on Cross-View Image Matching

E. Visualization Of Qualitative Result

For the two basic tasks of the University-1652 dataset: drone view target localization and drone Navigation, we visualize some matching results in Fig. 7. For drone-view target localization, we randomly selected eight test drone-view images and successfully retrieved the top five matching images from the gallery set, yielding completely accurate results [Fig. 7(a)]. In the drone navigation task, we similarly selected eight test satellite-view images and obtained perfect matches by selecting the top five similar images from the gallery set [Fig. 7(b)]. These results exemplify the effectiveness of our model in cross-view scenarios, emphasizing its robustness and practicality in real-world contexts.

Fig. 7. - Qualitative image match results. (a) Top-5 match results of drone-view target localization on University-1652. (b) Top-5 match results of drone navigation on University-1652. The red box indicates the true-matched image, and the blue box indicates the false-matched image.
Fig. 7. Qualitative image match results. (a) Top-5 match results of drone-view target localization on University-1652. (b) Top-5 match results of drone navigation on University-1652. The red box indicates the true-matched image, and the blue box indicates the false-matched image.

SECTION V. Conclusion

In this article, we propose an innovative transformer-based pipeline TransFG, which incorporates the FA and GG modules. TransFG synergically takes advantage of FA and GG. The FA module enhances the robustness of feature representation by learning salient features and dynamically aggregating contextual features. Furthermore, the GG module addresses position offset and distance-scale uncertainty through instance-level feature alignment, enhancing the practicality of cross-view image matching. Moreover, we also explore the TC-Loss and JS-Loss to implicitly bring different view data distributions closer, further improving cross-view image matching performance.

Extensive experiments on the University-1652 datasets demonstrate three aspects.

  1. TransFG outperforms the existing methods, which achieves a Recall@1 of 90.16% with an AP of 84.61% in the Satellite Drone task, and 84.01% Recall@1 with an AP of 86.31% in the Drone Satellite task.

  2. To assess generalizability, we performed Gaussian noise, Gaussian blur, low light, and thin cloud data augmentation in the entire test dataset. The experiments demonstrate that our pipeline also outperforms the current SOTA.

  3. Evidently, TransFG pipeline is overall a better choice for image matching tasks in cross-view geo-localization of satellite and UAVs imagery.

Given the challenging nature of comprehensive geographic datasets, we believe the benefits afforded by our pipeline are a significant step toward the advancement of cross-view geo-localization. We envisage that the proposed effective TransFG pipeline will also have a positive impact on feature representation, as well as on addressing positional shifts and distance-scale uncertainty. In addition, we also endeavor to establish a more densely populated dataset to attain precise localization.

 

 

No comments:

Post a Comment

Novel AI System Achieves 90% Accuracy in Detecting Drone Jamming Attacks

Loss convergence analysis using test data under LoS and NLoS conditions     Novel AI System Achieves 90% Accuracy in Detecting Drone Jamming...