VisIRNet: Deep Image Alignment for UAV-Taken Visible and Infrared Image Pairs | IEEE Journals & Magazine | IEEE Xplore
S. Özer and A. P. Ndigande, "VisIRNet: Deep Image Alignment for UAV-Taken Visible and Infrared Image Pairs," in IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-11, 2024, Art no. 5403111, doi: 10.1109/TGRS.2024.3367986.
keywords: {Feature extraction; Cameras; Autonomous aerial
vehicles; Prediction algorithms; Image resolution; Deep learning; Computer
architecture; Corner-matching; deep learning;image alignment; infrared
image registration; Lukas–Kanade (LK) algorithms; multimodal image
registration; unmanned aerial vehicle (UAV) image processing}
Abstract:
This article proposes a deep-learning-based solution for multimodal image alignment regarding unmanned aerial vehicle (UAV)-taken images. Many recently proposed state-of-the-art alignment techniques rely on using Lucas–Kanade (LK)-based solutions for a successful alignment. However, we show that we can achieve state-of-the-art results without using LK-based methods.
Our approach carefully utilizes a two-branch-based convolutional neural network (CNN) based on feature embedding blocks. We propose two variants of our approach, where in the first variant (Model A), we directly predict the new coordinates of only the four corners of the image to be aligned; and in the second one (Model B), we predict the homography matrix directly.
Applying alignment on the image corners forces the algorithm to match only those four corners as opposed to computing and matching many (key) points, since the latter may cause many outliers, yielding less accurate alignment. We test our proposed approach on four aerial datasets and obtain state-of-the-art results when compared to the existing recent deep LK-based architectures.
Introduction
Recent advancements in unmanned aerial vehicle (UAV) technologies, computing, and sensor technologies, allowed the use of UAVs for various earth observation applications. Many UAV systems are equipped with multiple cameras today, as cameras provide reasonable and relatively reliable information about the surrounding scene in the form of multiple images or image pairs. Such image pairs can be taken by different cameras, at different viewpoints, different modalities, or at different resolutions. In such situations, the same objects or the same features might appear at different coordinates on each image and, therefore, an image alignment (registration) step is needed before applying many other image-based computer vision applications such as image fusion, object detection, segmentation or object tracking as in [43], [44], and [45].
The infrared spectrum and visible spectrum may reflect different properties of the same scene. Consequently, images taken in those modalities, typically, differ from each other. On many digital cameras, the visible spectrum is captured and stored in the form of a red-green-blue (RGB) image model and a typical visible spectrum camera captures visible light ranging from approximately 400 to 700 nm in wavelength [6], [35]. Infrared cameras, on the other hand, capture wavelengths longer than those of visible light, falling between 700 and 10 000 nm [9]. Infrared images can be further categorized into different wavelength ranges as near-infrared (NIR), mid-infrared (MIR), and far-infrared (FIR) capturing different types of information in the spectrum [9], [10], [15], [16].
Image alignment is, essentially, the process of mapping the pixel coordinates from different coordinate system(s) into one common coordinate system. This problem is studied under different names including image registration and image alignment. We will also use the terms alignment and registration interchangeably in this article. Typically, alignment is done in the form of image pairs mapping from one image (source) onto the other one (target) [18]. Image alignment is a common problem that exists in many image-based applications where both the target and source images can be acquired by sensors using the same modality or using different modalities. There is a wide range of applications of image alignment in many fields including medical imaging [1], [21], UAV applications [22], [36], image stitching [8] and remote-sensing applications [5], [5], [29], [30], [37].
Image
alignment, in many cases, can be reduced to the problem of estimating
the parameters of the perspective transformation between two images
acquired by two separate cameras, where we assume that the cameras are
located on the same UAV system. Fig. 1 summarizes such an image alignment process where the input consists of a higher resolution RGB image (e.g.,
In the relevant literature, registering RGB and IR image pairs is done by using both classical techniques (such as scale-invariant feature transform (SIFT) [33] along with the random sample consensus (RANSAC) [14] algorithm as in [3]) and by using more recent deep-learning-based techniques as in [7], [34], [52]. Classical techniques include feature-based [40], [50] and intensity-based [39] methods. Feature-based [40], [50] methods essentially find correspondences between the detected salient features from images [47]. Salient features are computed by using approaches such as SIFT [32], speeded-up robust features (SURF) [4], Harris Corner [19], and Shi-Tomas corner detectors [24] in each image. The features from both images are then matched to find the correspondences as in [41], [42], and [46], and to compute the transformation parameters in the form of a homography matrix. The RANSAC [46] algorithm is commonly used to compute the homography matrix that minimizes the total number of outliers in the literature. Intensity-based [39] methods compare intensity patterns in images via similarity metrics. By estimating the movement of each pixel, optical flow is computed and used to represent the overall motion parameters. In [2] and [13] uses Lucas–Kanade (LK)-based algorithms that take the initial parameters and iteratively estimate a small change in the parameters to minimize the error. A typical intensity-based registration technique essentially uses a form of similarity as its metric or as its registration criteria including mean squared error (MSE) [17], cross correlation [28], structural similarity index (SSIM), and a peak signal-to-noise ratio (PSNR) [51]. Such metrics are not sufficient when the source image and target image are acquired by different modalities. This can yield poor performance when such intensity-based methods are used.
Overall, such major classical approaches, typically, are based on finding and matching similar salient keypoints in image pairs, and therefore, they can yield unsatisfactory results in various multimodal registration applications.
Relevant deep alignment approaches use a form of keypoint matching, template matching, or LK-based approaches as in [7] and [27]. Those techniques typically consider multiple points or important regions in images to compute the homography matrix H
which contains the transformation parameters. However, having the
information of four matching points represented by their corresponding
2-D coordinates
In this article, we propose a novel deep approach for registering IR and RGB image pairs, where instead of predicting the homography matrix directly, we predict the location of the four corner points of the entire image directly. This approach removes the additional iterative steps introduced by LK-based algorithms and eliminates the steps of computing and finding important keypoints. Our main contributions can be listed as follows.
We introduce a novel deep approach for alignment problems of IR images onto RGB images taken by UAVs, where the resolutions of the input images differ from each other.
We introduce a novel two-branch-based deep solution for registration without relying on the Lukas–Kanade-based iterative methods.
Instead of predicting the homography matrix directly, we predict the corresponding coordinates of the four corner points of the smaller image on the larger image.
We study and report the performance of our approach on multiple aerial datasets and present the state-of-the-art results.
Related Work
Many recent techniques performing image alignment rely on deep learning. Convolutional neural networks (CNNs) form a pipeline of convolutional layers where filters learn unique features at distinct levels of the network. For example, DeTone et al. [12] proposed a deep image homography estimation network (DHN) that uses CNNs to learn meaningful features in both images and it directly predicts the eight affine transformation parameters of the homography matrix. Later, Le et al. [26] proposed using a series of networks to regress the homograph parameters in their approach. The latter networks in their proposed architecture aim to gradually improve the performance of the earlier networks. Their method builds on top of DHN [12]. Another work in [7] proposed incorporating the LK algorithm in the deep-learning pipeline.
Zhao et al. [52] used a CNN-based network and introduced a learning-based LK block. In their work, they designed modality-specific pipelines for both source and template images, respectively. At the end of each block, there is a unique feature construction function. Instead of using direct output feature maps, they constructed features based on Eigen values and Eigen vectors of the output feature maps. The features constructed from the source and template network channels have a similar learned representation. Transformation parameters found at a lower scale are given as input to the next level and the LK algorithm iterates until a certain threshold is reached. In another work, Deng et al. [11] utilized disentangled convolutional sparse coding to separate domain-specific and shared features of multimodal images for improved accuracy of registration. Multiscale generative adversarial networks (GANs) are also used to estimate homography parameters as in [34].
The architectural comparisons of the above-mentioned multiple networks are provided in Fig. 2. In DHN [12], the image to be transformed (it is noted as IIR in the figure) is padded to have the same dimensions as the target image (IRGB)
and they are concatenated channel-wise. The concatenated images are
given to the deep homography network (DHN) for the direct regression of
the eight values of the homography matrix. On the other hand, multiscale
homography estimation (MHN) [26] adapts using a series of networks (
Proposed Approach: VisIRNet
In our proposed approach, we aim at performing accurate, single, and multimodal image registration which is free of the iterative nature of LK-based algorithms. We name our network VisIRNet, where we aim to predict the location of the corners of the input image on the target image directly since having four matching points is sufficient to compute the homography parameters. In our proposed architecture, we assume that there are two input images with different resolutions. The overview of our architecture is given in Fig. 3. Our approach first processes two inputs separately by passing them through their respective feature embedding blocks and extracts representative features. Those features are then combined and given to the regression block as input. The goal of the regression block is to compute the transformation parameters accurately. The output of the regression block is eight-dimensional (which can represent the total number of homography parameters or the coordinates of the four corner points of the source image on the target image).
A. Preliminaries
1) Perspective Transformation:
Here,
by perspective transformation, we mean a linear transformation in the
homogenous coordinate system which, in some sense, warps the source
image onto the target image. The homography matrix consists of the
transformation parameters needed for the perspective transformation. The
elements of the
B. Network Structure
Our proposed network is composed of multimodal feature embedding blocks (MMFEB) and a regression block (see Fig. 3).
The regression block is responsible for predicting the eight homography
matrix parameters directly or indirectly. In this article, we study the
performance of two variants of our proposed model and we call them
Model A and Model B. Model A predicts the coordinates of the corner
points while its variant, Model B, predicts the direct homography
parameters. In Model A, four corners are enough to find the homography
matrix. Therefore, the last layer has eight neurons for the four
1) Multimodal Feature Embedding Backbone:
MMFEB
is responsible for producing a combined representative feature set
formed of fine-level features for both of the input images. The network
then will use that combined representative feature set to transform the
source image onto the target image. We adapt the idea of giving RGB and
infrared modalities separate branches as in [52].
We use two identical networks (branches) with the same structure but
with different parameters for RGB and infrared images, respectively.
Therefore, the multimodal feature embedding block has two parallel
branches with identical architectures (however, they do not share
parameters), namely the RGB branch and the infrared branch. We first
train the multimodal feature embedding backbone by using average
similarity loss
Algorithm 1 Training Steps of the MMFEB
for
for
end for
end for
MMFEB is trained by using the
Algorithm 3 Computing the Lsim
loss
for
end for
2) Regression Block:
The
second main stage of our pipeline is the regression block which is
responsible for making the final prediction. The prediction can be the
four corner locations if Model A; or the unknown parameters of the
homography matrix if Model B.
The architecture for the regression block is further divided into two subparts as shown in Fig. 3.
The first part is composed of six levels. Apart from the last level,
each level is composed of two sublevels followed by a max-pooling layer.
Sublevel is a convolution layer followed by a batch normalization layer
followed by a relu activation function. Sublevels
C. Loss
While MMFEB uses similarity loss, we used two loss terms based on the corner error and homography for the regression head.
1) Similarity Loss:
The similarity loss is used to train MMFEB and is defined as follows:
Algorithm 2 Training Step of the Regression Block
let M be RegressionBlock
for
for
end for
end for
2) L2
Homography Loss Term:
Model
B is trained to predict the values of the elements of the homography
matrix. Therefore, its output is the eight elements of a
3) Average Corner Error:
Ace
is computed as the average sum of squared differences between the
predicted and ground-truth locations of the corner points. For Model B,
we use the predicted homography matrix to transform the four corners of
the infrared image onto the coordinate system of the RGB image, and
together with ground-truth locations we compute
In ModelB, we predict the
In
addition to these loss functions, we also used additional loss
functions in the MMFEB block during our ablation study. Those functions
are
Experiments
In this section, we describe our experimental procedures, used datasets, and our metrics. Below we describe our used datasets.
A. Datasets
In our experiments, we use Skydata1 containing RGB and IR image pairs, MSCOCO [31], Google-Maps, and Google-Earth (as taken from DLKFM [52]), VEDAI [48] datasets. Refer to Table III for more details about the used datasets in our experiments. SkyData is originally a video-based dataset that provides each frame of the videos in image format.
B. Generating the Training and Test Sets
To train the algorithms, we need unregistered and registered (ground truth) image pairs. For SkyData, we randomly select
For each dataset that we use, we generate the training and test sets as follows.
Select a registered image pair at higher resolutions.
Sample (crop) regions around the center of the image to get smaller patches of
192×192 pixels. This process is done in parallel for visible and infrared images.If the extracted patches are not sufficiently aligned, manually align them.
For each pair, select a subset of the IR image, by randomly selecting four distinct locations on the image.
Find perspective transformation parameters that map those randomly chosen points to the following fixed locations: (0,0), (
n−1 ,0), (n−1,n−1 ), (0,n−1 ) so that they can correspond to the corners of the unregistered IR image patch, where we assume that the unregisteredIIR isn×n dimensional (in our experimentsn is set to 128). This process creates an unregistered infrared patch (from the already registered ground truth) that needs to be placed back to its true position.Use those four initially selected points as the ground-truth corners for the registered image.
Repeat process
k times to createk different image pairs. This newly created dataset is then split into training and test sets.The RGB images are used as the target set and the transformed infrared patches are used as the source set (for both training and testing).
C. Evaluation Metrics
As shown in Table IV,
we quantitatively evaluate the performance of our models using Ace and
homography error. We compute each algorithm’s result distribution in
terms of quantiles, mean, standard deviation, and min-max values for a
given test set. Quartiles are a set of descriptive statistics that
summarize central tendency and variability of data [49].
Quartiles are a specific type of quantiles that divide the data into
four equal parts. The three quartiles are denoted as Q1, Q2 (which is
also known as the median), and Q3. The 25% (Q1), 50% (Q2), and 75% (Q3)
percentiles indicate that k% of the data falls below the
Table V
shows an ablation study on using different loss functions in each block
in our architecture. The used metric in the table is Ace and the best
values are shown in bold. The loss functions in each row are used to
train the MMFEB block including
Next, we provide experimental results on the effect of the hyperparameters that we studied for both Model A and Model B. Table VI summarizes those results. In particular, we studied the effect of using different loss functions (
Fig. 7
uses a box plot, also known as a box-and-whisker plot, to display the
distribution of ACE for different datasets and different models. It
provides a summary of key statistical measures such as the minimum,
first quartile (Q1), median (Q2), third quartile (Q3), and maximum. The
length of the box indicates the spread of the middle 50% of the data.
The line inside the box represents the median (Q2). The whiskers extend
from the box and represent the variability of the data beyond the
quartiles, in our case, they represent
Fig. 5 shows the performance of six methods (SIFT, DHN, MHN, CLKN, DLKFM, and Ours) on the SkyDataV1 dataset, in terms of ACE. Skydata has RGB and infrared image pairs. In this figure, we aim to show that feature-based registration techniques such as SIFT perform poorly, whereas methods that leverage neural networks and learn representations are superior.
Fig. 4
gives detailed qualitative results of our experiments. Each row
represents a sample taken from a different dataset. The columns
represent inputs and results for different approaches. Target is (
Table IV illustrates the results of using different approaches for each dataset, separately. In Table IV(e), the MSCOCO results being a single modality dataset, SIFT performs relatively better but there are cases where the algorithm could not find homography due to insufficient pairs. Google earth in (c) also has RGB image pairs but from different seasons. The SIFT algorithm is still able to pick enough salient features, therefore, the performance is still reasonable. (d) Google maps, (a) SkyData and (b) VEDAI have pairs of significant modality difference. Deep-learning-based approaches were able to perform registration often with a high number of outliers. Our approach was able to perform registration on both single and multimodal image pairs, specifically, we were able to keep the max error minimum as opposed to LK-based approaches.
Conclusion and Discussion
In this article, we introduce a novel image alignment algorithm that we call VisIRNet. VisIRNet has two branches and does not have any stage to compute keypoints. Our experimental results show that our proposed algorithm performs state-of-the-art results when it is compared to the LK-based deep approaches.
Our method’s main advantages can be listed as follows.
Number of iterations during inference: The above-mentioned LK-based methods (after the training stage), also iterate a number of times during the inference stage, and at each iteration, they try to minimize the loss. However, those methods are not guaranteed to converge to the optimal solution and often number of iterations, chosen as a hyperparameter, is an arbitrary number during the inference stage. Such iterative approaches introduce uncertainty for the processing time, as convergence can happen after the first iteration in some situations and after the last iteration in other situations during inference. Such uncertainty also affects the real-time processing of images, as they can introduce varying frame-per-second values. Our method uses a single pass during inference to make it more applicable to real-time applications.
Dependence on the initial H estimate: In addition to the above-mentioned difference, the LK-based algorithms require an initial estimate of the homography matrix and the performance (and number of iterations required for convergence) directly depend on the initial estimate of H and, therefore, it is typically given as input (hyperparameter). While we also have initialization of the weights in our architecture, we do not need an initial estimate of the homography matrix within the architecture as input.
Image alignment on image pairs taken by different onboard cameras on UAVs is a challenging and important topic for various applications. When the images to be aligned are acquired by different modalities, the classic approaches, such as SIFT and RANSAC combination, can yield insufficient results. Deep-learning techniques can be more reliable in such situations as our results demonstrate. LK-based deep techniques have recently shown promise, however, we demonstrate with our approach (VisIRNet) that without designing any LK-based block, and by focusing only on the four corner points, we can sufficiently train deep architectures for image alignment.
ACKNOWLEDGMENT
This article has been produced benefiting from the 2232 International Fellowship for Outstanding Researchers Program of TUBITAK (Project No: 118C356). However, the entire responsibility of the article belongs to the owner of the article.
No comments:
Post a Comment