Aerospace Electronic and Defense Systems: VisIRNet: Deep Image Alignment for UAV-Taken Visible and Infrared Image Pairs | IEEE Journals & Magazine

$Fig. 1. - Overview of the image alignment process is shown. On the left, input RGB, $I_{\text {RGB}}$ ( $192 \times 192$ pixels) and IR, $I_{{\textrm {IR}}}$ ( $128 \times 128$ pixels) images are shown. The $I_{{\textrm {IR}}}$ is shown in pseudocolors. Both images are given as input to the registration stage where the transformation parameters represented by the homography matrix ( $H$ ) are predicted. After the registration process, the $I_{{\textrm {IR}}}$ is transformed (warped) onto the $I_{\text {RGB}}$ space by locating the positions of $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ as $c_{1}^{\prime }$ , $c_{2}^{\prime }$ , $c_{3}^{\prime }$ , and $c_{4}^{\prime }$ . The warped $I_{{\textrm {IR}}}$ is overlayed (where $\alpha = 0.4$ ) on the $I_{\text {RGB}}$ .$

Fig. 1. Overview of the image alignment process is shown.

On the left, input RGB, IRGB (192×192 pixels) and IR, IIR (128×128 pixels) images are shown. The IIR is shown in pseudocolors.

Both images are given as input to the registration stage where the transformation parameters represented by the homography matrix (H ) are predicted.

After the registration process, the IIR is transformed (warped) onto the IRGB space by locating the positions of c1 , c2 , c3 , and c4 as c′1 , c′2 , c′3 , and c′4 . The warped IIR is overlayed (where α=0.4 ) on the IRGB .

Summary

This paper proposes a novel deep learning approach called VisIRNet for aligning visible and infrared image pairs taken from unmanned aerial vehicles (UAVs). Many existing techniques rely on iterative Lucas-Kanade algorithms and keypoint matching, which can be computationally expensive and inaccurate for multimodal images.

VisIRNet uses a two-branch convolutional neural network to extract modality-specific features from the visible and infrared images separately. Instead of predicting the full homography matrix, it directly predicts the coordinates of the four corner points of the infrared image on the visible image. This removes the need for iterative refinement and keypoint extraction steps.

The authors evaluate VisIRNet on several aerial datasets containing visible-infrared image pairs. Their approach achieves state-of-the-art registration accuracy compared to existing deep learning techniques like DHN, MHN, CLKN and DLKFM. VisIRNet exhibits smaller standard deviations in registration errors and can handle both single and multimodal image pairs effectively. The key advantages are faster inference without iterations, no dependence on initial homography estimates, and accurate corner prediction instead of full homography estimation.

Authors

Sedat Özer is listed as a Senior Member of IEEE. He is currently an Assistant Professor in the Department of Computer Science at Ozyegin University in Istanbul, Turkey.

Alain P. Ndigande Alain P. Ndigande received the B.Eng. degree from Kocaeli University, İzmit, Turkey, in 2022. He is currently pursuing the M.Sc. degree with Ozyegin University, İstanbul, Türkiye.,His research interests include deep learning, image registration, and remote sensing..

The work was supported by a project (Project No: 118C356) under TUBITAK, which is the Scientific and Technological Research Council of Turkey, indicating the authors' institutional association with universities/research centers in Turkey.

Artifacts and Data

The paper does not provide any details about live testing or the specific sensors used for data collection. The experiments seem to be conducted purely on existing aerial datasets containing visible and infrared image pairs.

The paper mentions using the following datasets for training and evaluation:

SkyData - Contains RGB and infrared image pairs from UAV videos.
MSCOCO - Microsoft Common Objects in Context, A widely used dataset for object detection/segmentation tasks.
Google Maps/Earth - Satellite imagery.
VEDAI - A vehicle detection dataset with aerial imagery.

Another database which could be used for this function include

VisDrone-DroneVehicle Drone-based RGB-Infrared Cross-Modality Vehicle Detection via Uncertainty-Aware Learning (paper).

However, there is no mention of the authors collecting their own data using specific sensors on UAVs or other platforms for live testing of their VisIRNet approach. The paper focuses solely on evaluating the proposed deep learning model on pre-existing multi-modal aerial datasets.

The authors do not provide any information about releasing their code, trained models or any artifacts from their work. The paper lacks details on reproducing their results or accessing any supplementary materials beyond what is presented in the manuscript itself.

VisIRNet: Deep Image Alignment for UAV-Taken Visible and Infrared Image Pairs | IEEE Journals & Magazine | IEEE Xplore

S. Özer and A. P. Ndigande, "VisIRNet: Deep Image Alignment for UAV-Taken Visible and Infrared Image Pairs," in IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-11, 2024, Art no. 5403111, doi: 10.1109/TGRS.2024.3367986.

keywords: {Feature extraction; Cameras; Autonomous aerial vehicles; Prediction algorithms; Image resolution; Deep learning; Computer architecture; Corner-matching; deep learning;image alignment; infrared image registration; Lukas–Kanade (LK) algorithms; multimodal image registration; unmanned aerial vehicle (UAV) image processing}

Abstract:

This article proposes a deep-learning-based solution for multimodal image alignment regarding unmanned aerial vehicle (UAV)-taken images. Many recently proposed state-of-the-art alignment techniques rely on using Lucas–Kanade (LK)-based solutions for a successful alignment. However, we show that we can achieve state-of-the-art results without using LK-based methods.

Our approach carefully utilizes a two-branch-based convolutional neural network (CNN) based on feature embedding blocks. We propose two variants of our approach, where in the first variant (Model A), we directly predict the new coordinates of only the four corners of the image to be aligned; and in the second one (Model B), we predict the homography matrix directly.

Applying alignment on the image corners forces the algorithm to match only those four corners as opposed to computing and matching many (key) points, since the latter may cause many outliers, yielding less accurate alignment. We test our proposed approach on four aerial datasets and obtain state-of-the-art results when compared to the existing recent deep LK-based architectures.

Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 62)

Article Sequence Number: 5403111

Date of Publication: 20 February 2024

ISSN Information:

DOI: 10.1109/TGRS.2024.3367986

Funding Agency:

SECTION I.

Introduction

Recent advancements in unmanned aerial vehicle (UAV) technologies, computing, and sensor technologies, allowed the use of UAVs for various earth observation applications. Many UAV systems are equipped with multiple cameras today, as cameras provide reasonable and relatively reliable information about the surrounding scene in the form of multiple images or image pairs. Such image pairs can be taken by different cameras, at different viewpoints, different modalities, or at different resolutions. In such situations, the same objects or the same features might appear at different coordinates on each image and, therefore, an image alignment (registration) step is needed before applying many other image-based computer vision applications such as image fusion, object detection, segmentation or object tracking as in [43], [44], and [45].

The infrared spectrum and visible spectrum may reflect different properties of the same scene. Consequently, images taken in those modalities, typically, differ from each other. On many digital cameras, the visible spectrum is captured and stored in the form of a red-green-blue (RGB) image model and a typical visible spectrum camera captures visible light ranging from approximately 400 to 700 nm in wavelength [6], [35]. Infrared cameras, on the other hand, capture wavelengths longer than those of visible light, falling between 700 and 10 000 nm [9]. Infrared images can be further categorized into different wavelength ranges as near-infrared (NIR), mid-infrared (MIR), and far-infrared (FIR) capturing different types of information in the spectrum [9], [10], [15], [16].

Image alignment is, essentially, the process of mapping the pixel coordinates from different coordinate system(s) into one common coordinate system. This problem is studied under different names including image registration and image alignment. We will also use the terms alignment and registration interchangeably in this article. Typically, alignment is done in the form of image pairs mapping from one image (source) onto the other one (target) [18]. Image alignment is a common problem that exists in many image-based applications where both the target and source images can be acquired by sensors using the same modality or using different modalities. There is a wide range of applications of image alignment in many fields including medical imaging [1], [21], UAV applications [22], [36], image stitching [8] and remote-sensing applications [5], [5], [29], [30], [37].

Image alignment, in many cases, can be reduced to the problem of estimating the parameters of the perspective transformation between two images acquired by two separate cameras, where we assume that the cameras are located on the same UAV system. Fig. 1 summarizes such an image alignment process where the input consists of a higher resolution RGB image (e.g., 192×192 pixels) and a lower-resolution IR image (e.g., 128×128 pixels visualized in pseudocolors in the figure). The output of the registration algorithm is the registered (aligned) IR image on the RGB image’s coordinate system. As perspective transformation [20] is typically enough for UAV setups containing nearby onboard cameras, our registration process uses a registration function based on the Homography (H) matrix. H contains eight unknown (projection) parameters and the goal of the registration process is to predict those eight unknown parameters, directly or indirectly.

Fig. 1.

Overview of the image alignment process is shown. On the left, input RGB, IRGB (192×192 pixels) and IR, IIR (128×128 pixels) images are shown. The IIR is shown in pseudocolors. Both images are given as input to the registration stage where the transformation parameters represented by the homography matrix (H ) are predicted. After the registration process, the IIR is transformed (warped) onto the IRGB space by locating the positions of c1 , c2 , c3 , and c4 as c′1 , c′2 , c′3 , and c′4 . The warped IIR is overlayed (where α=0.4 ) on the IRGB .

Show All

In the relevant literature, registering RGB and IR image pairs is done by using both classical techniques (such as scale-invariant feature transform (SIFT) [33] along with the random sample consensus (RANSAC) [14] algorithm as in [3]) and by using more recent deep-learning-based techniques as in [7], [34], [52]. Classical techniques include feature-based [40], [50] and intensity-based [39] methods. Feature-based [40], [50] methods essentially find correspondences between the detected salient features from images [47]. Salient features are computed by using approaches such as SIFT [32], speeded-up robust features (SURF) [4], Harris Corner [19], and Shi-Tomas corner detectors [24] in each image. The features from both images are then matched to find the correspondences as in [41], [42], and [46], and to compute the transformation parameters in the form of a homography matrix. The RANSAC [46] algorithm is commonly used to compute the homography matrix that minimizes the total number of outliers in the literature. Intensity-based [39] methods compare intensity patterns in images via similarity metrics. By estimating the movement of each pixel, optical flow is computed and used to represent the overall motion parameters. In [2] and [13] uses Lucas–Kanade (LK)-based algorithms that take the initial parameters and iteratively estimate a small change in the parameters to minimize the error. A typical intensity-based registration technique essentially uses a form of similarity as its metric or as its registration criteria including mean squared error (MSE) [17], cross correlation [28], structural similarity index (SSIM), and a peak signal-to-noise ratio (PSNR) [51]. Such metrics are not sufficient when the source image and target image are acquired by different modalities. This can yield poor performance when such intensity-based methods are used.

Overall, such major classical approaches, typically, are based on finding and matching similar salient keypoints in image pairs, and therefore, they can yield unsatisfactory results in various multimodal registration applications.

Relevant deep alignment approaches use a form of keypoint matching, template matching, or LK-based approaches as in [7] and [27]. Those techniques typically consider multiple points or important regions in images to compute the homography matrix H which contains the transformation parameters. However, having the information of four matching points represented by their corresponding 2-D coordinates (xi,yi) , where i=1,2,3,4 is sufficient to estimate H. Therefore, if found accurately, four matching image-corner points between the IR and RGB images would be enough to perform accurate registration between the IR and RGB images. While many techniques based on keypoint extraction can be employed to find matching keypoints between the images, we argue that the corner points on the borders of one image can also be considered as keypoints, and by using those corners of the image, we do not need to utilize any keypoint extraction step.

In this article, we propose a novel deep approach for registering IR and RGB image pairs, where instead of predicting the homography matrix directly, we predict the location of the four corner points of the entire image directly. This approach removes the additional iterative steps introduced by LK-based algorithms and eliminates the steps of computing and finding important keypoints. Our main contributions can be listed as follows.

We introduce a novel deep approach for alignment problems of IR images onto RGB images taken by UAVs, where the resolutions of the input images differ from each other.
We introduce a novel two-branch-based deep solution for registration without relying on the Lukas–Kanade-based iterative methods.
Instead of predicting the homography matrix directly, we predict the corresponding coordinates of the four corner points of the smaller image on the larger image.
We study and report the performance of our approach on multiple aerial datasets and present the state-of-the-art results.

SECTION II.

Related Work

Many recent techniques performing image alignment rely on deep learning. Convolutional neural networks (CNNs) form a pipeline of convolutional layers where filters learn unique features at distinct levels of the network. For example, DeTone et al. [12] proposed a deep image homography estimation network (DHN) that uses CNNs to learn meaningful features in both images and it directly predicts the eight affine transformation parameters of the homography matrix. Later, Le et al. [26] proposed using a series of networks to regress the homograph parameters in their approach. The latter networks in their proposed architecture aim to gradually improve the performance of the earlier networks. Their method builds on top of DHN [12]. Another work in [7] proposed incorporating the LK algorithm in the deep-learning pipeline.

Zhao et al. [52] used a CNN-based network and introduced a learning-based LK block. In their work, they designed modality-specific pipelines for both source and template images, respectively. At the end of each block, there is a unique feature construction function. Instead of using direct output feature maps, they constructed features based on Eigen values and Eigen vectors of the output feature maps. The features constructed from the source and template network channels have a similar learned representation. Transformation parameters found at a lower scale are given as input to the next level and the LK algorithm iterates until a certain threshold is reached. In another work, Deng et al. [11] utilized disentangled convolutional sparse coding to separate domain-specific and shared features of multimodal images for improved accuracy of registration. Multiscale generative adversarial networks (GANs) are also used to estimate homography parameters as in [34].

The architectural comparisons of the above-mentioned multiple networks are provided in Fig. 2. In DHN [12], the image to be transformed (it is noted as I_IR in the figure) is padded to have the same dimensions as the target image (I_RGB) and they are concatenated channel-wise. The concatenated images are given to the deep homography network (DHN) for the direct regression of the eight values of the homography matrix. On the other hand, multiscale homography estimation (MHN) [26] adapts using a series of networks (Neti ). The inputs for Net₂ are a concatenation of I_IR and I_RGB. For the succeeding levels, first, the warping function performs the projective inverse warping operation on the infrared image (I_IR) via the homography matrix which was predicted at the previous level. The resulting image (I′IRi ) is first concatenated with I_RGB and then given as input to the Neti . For the following levels, the current matrix and previously predicted matrices are multiplied to form the final prediction. This way MHN aims to learn to correct mistakes made in the earlier levels. Cascaded LK network (CLKN) [7] uses separate networks for each modality. They use levels of different scales in the form of feature pyramid networks and perform registration from the smallest to the largest. The homography matrix from the earlier LK-layer is given as input to the next. Deep LK feature maps (DLKFM) [52] also perform coarse to fine registration as shown in Fig. 2. It uses a special feature construction block called (fcb). The (fcb) block takes in the feature maps and transforms them into new features based on the Eigen vectors and covariance matrix. The constructed features capture principal information and the registration is performed on the constructed feature maps, thus, it aims to increase the accuracy of the LK-layer. Our approach uses separate feature embedding blocks to process each modality separately. It is trained to extract modality-specific features so that the output feature maps of different modalities can have similar feature representations.

$Fig. 2. - Architectures of various recently proposed deep alignment algorithms including DHN [12], MHN [26], CLKN [7], and DLKFM [52]. While DHN and MHN predict the homography parameters H; CLKN and DLKFM rely on using LK-based iterative approach and they use feature maps at different resolutions. By doing so, they predict homography in steps H $_{\textbf {i}}$ where each step aims to correct the previous prediction.$

Fig. 2.

Architectures of various recently proposed deep alignment algorithms including DHN [12], MHN [26], CLKN [7], and DLKFM [52]. While DHN and MHN predict the homography parameters H; CLKN and DLKFM rely on using LK-based iterative approach and they use feature maps at different resolutions. By doing so, they predict homography in steps Hi where each step aims to correct the previous prediction.

Show All

SECTION III.

Proposed Approach: VisIRNet

In our proposed approach, we aim at performing accurate, single, and multimodal image registration which is free of the iterative nature of LK-based algorithms. We name our network VisIRNet, where we aim to predict the location of the corners of the input image on the target image directly since having four matching points is sufficient to compute the homography parameters. In our proposed architecture, we assume that there are two input images with different resolutions. The overview of our architecture is given in Fig. 3. Our approach first processes two inputs separately by passing them through their respective feature embedding blocks and extracts representative features. Those features are then combined and given to the regression block as input. The goal of the regression block is to compute the transformation parameters accurately. The output of the regression block is eight-dimensional (which can represent the total number of homography parameters or the coordinates of the four corner points of the source image on the target image).

Fig. 3.

Overview of our proposed network architecture. Two parallel branches including the RGB branch and IR branch (feature embedding blocks) extract the salient features for RGB and IR images, respectively. Those features are, then channel-wise concatenated and fed into the regression block for direct (Model B) or indirect (Model A) homography prediction, that is, the model can be trained for learning the homography matrix in Model B or to regress the corresponding coordinates of the four corners of the input IR image on the RGB image in Model A. The output is an eight-dimensional vector (for H) if Model B is used; and it is an eight-dimensional vector where those eight values correspond to the (x,y) coordinates of the four corners of the IR image, if Model A is used. The details of the feature embedding block are given in the top corner of the figure (also see Table I). The details of the regression block are given in the lower right corner of the figure (also see Table II).

Show All

A. Preliminaries

1) Perspective Transformation:

Here, by perspective transformation, we mean a linear transformation in the homogenous coordinate system which, in some sense, warps the source image onto the target image. The homography matrix consists of the transformation parameters needed for the perspective transformation. The elements of the 3×3 dimensional homography matrix represent the amount of rotation, translation, scaling, and skewing motions. Homography matrix H is defined as follows:

H = ⎡ ⎣ ⎢ p 1 p 4 p 7 p 2 p 5 p 8 p 3 p 6 1 ⎤ ⎦ ⎥ (1)

View Source

where the last element (p9 ) is set to 1 to ensure the validity of conversion from homogeneous to the Cartesian coordinates. Warping function maps a set of coordinates [(xi,yi),…,] to another coordinate system via H. Let ci=(xi,yi) be the location of a point in the coordinates set C of the source image. Let W(c,P) be the warping function that warps given coordinate c with parameter set P of H to the target image

c' i = W (c i, P) . (2)

View Source

The warping process is a linear transformation in a homogeneous coordinate system. Therefore, the Cartesian coordinates are first transformed into the homogeneous coordinate system by adding the extra z dimension to the 2-D Cartesian pixel coordinates. Let ci be the pixel with xi,yi coordinates. Homogeneous coordinate of ci can be represented by setting z−axis to 1, that is, chi=(xi,yi,1) . Once we have the homography matrix, we warp any given ith pixel location ci represented by (xi,yi) to its warped version cwarpedi on the other image’s Cartesian coordinate as follows:

c warped h i = W (c i, P) ⟺ ⎡ ⎣ ⎢ x' i y' i z' i ⎤ ⎦ ⎥ = ⎡ ⎣ ⎢ p 1 p 4 p 7 p 2 p 5 p 8 p 3 p 6 1 ⎤ ⎦ ⎥ ⎡ ⎣ ⎢ x i y i 1 ⎤ ⎦ ⎥ (3)

View Source

where x′i,y′i,z′i , are warped homogeneous coordinates of cwarpedi which can be converted to Cartesian coordinates by simply division by the z′i value. Therefore, we can obtain the final warped 2-D pixel coordinates in Cartesian coordinates as follows: c′i=(x′i,y′i) , where

x warped i y warped i = x ' i z ' i ⟺ p 1 x i + p 2 y i + p 3 p 7 x i + p 8 y i + 1 = y ' i z ' i ⟺ p 4 x i + p 5 y i + p 6 p 7 x i + p 8 y i + 1 . (4) (5)

View Source

B. Network Structure

Our proposed network is composed of multimodal feature embedding blocks (MMFEB) and a regression block (see Fig. 3). The regression block is responsible for predicting the eight homography matrix parameters directly or indirectly. In this article, we study the performance of two variants of our proposed model and we call them Model A and Model B. Model A predicts the coordinates of the corner points while its variant, Model B, predicts the direct homography parameters. In Model A, four corners are enough to find the homography matrix. Therefore, the last layer has eight neurons for the four (x,y) corner components for Model A, or the eight unknown homography parameters for Model B.

1) Multimodal Feature Embedding Backbone:

MMFEB is responsible for producing a combined representative feature set formed of fine-level features for both of the input images. The network then will use that combined representative feature set to transform the source image onto the target image. We adapt the idea of giving RGB and infrared modalities separate branches as in [52]. We use two identical networks (branches) with the same structure but with different parameters for RGB and infrared images, respectively. Therefore, the multimodal feature embedding block has two parallel branches with identical architectures (however, they do not share parameters), namely the RGB branch and the infrared branch. We first train the multimodal feature embedding backbone by using average similarity loss Lsim (see 6). To compute the similarity loss, we first generate a 128×128 rectilinear grid, representing locations in the infrared coordinate system as in spatial transformers [23]. Then, we use the ground-truth homography matrix to warp the grid onto the RGB coordinate system resulting in a warped curvilinear grid representing projected locations. We use bilinear interpolation [25], [38] to sample those warped locations on the RGB feature maps (fRGB ). After that, we can compute the similarity loss between IR feature maps and resampled RGB feature maps. Algorithm 1 provides the algorithmic details of calculating the similarity loss for the feature embedding block.

Algorithm 1 Training Steps of the MMFEB

Inputs:

I∗RGB,I∗IR ▹ * indicates whole training set

for e←0 to epochs do

for batch←0 to datasetSize/batchsize do

IRGB=I∗RGB[batch]

IIR=I∗IR[batch]

H←groundTruthHomography

fIR←RGBbranch(IRGB)

fRGB←IRbranch(IIR)

simLoss←Lsim(fIR,fRGB,H)

Backprop(simLoss) Using AdamOptimizer

end for

MMFEB is trained by using the Lsim (see 6) which is detailed in Algorithm 3. Steps for training the MMFEB are given in Algorithm 1. The regression block is trained with homography loss (LH2 ) in combination with average corner error (LAce ) (see the “average corner error (ACE)” Section for the definitions of LAce ), yielding the total loss L to train our model. Table I summarizes the structure of our used MMFEB.

TABLE I Layer-by-Layer Details of the Feature Embedding Block. There are Also Skip Connections Between the Layers in This Architecture as Shown in Fig. 3

Algorithm 3 Computing the Lsim loss

fRGB←RGBbranch(IRGB)

fIR←IRbranch(IIR)

H←groundTruthHomography

gridn×n←2x2gridwithIrdimensions

Ensure:

warpedGrid=warpGrid(gridn×n,H−1)

f′RGB=BilinearSampler(fRGB,warpedGrid)

Lsim←0

for i=0,i≤n×n do

Iri=fIR[i]

Rgbi=f′RGB[i]

Pdiff←Iri−Rgbi

Lsim←Lsim+P2diff

end for

Lsim←Lsim/(n×n)

2) Regression Block:

The second main stage of our pipeline is the regression block which is responsible for making the final prediction. The prediction can be the four corner locations if Model A; or the unknown parameters of the homography matrix if Model B. fRGB and fIR are the feature maps extracted by passing the RGB image and infrared image through their respective feature embedding blocks in the feature embedding block. Note that fRGB and fIR have different dimensions. Therefore, we apply zero-padding to the lower dimensional feature maps (fIR ) so that we can bring its dimensions to the dimensions of fRGB , resulting fIRpadded . We concatenate (channel-wise) fIRpadded to fRGB feature maps coming from infrared and RGB feature embedding blocks and use that as input for the regression block.

The architecture for the regression block is further divided into two subparts as shown in Fig. 3. The first part is composed of six levels. Apart from the last level, each level is composed of two sublevels followed by a max-pooling layer. Sublevel is a convolution layer followed by a batch normalization layer followed by a relu activation function. Sublevels m and n of a level l are identical in terms of the filters used, kernel size, stride, and padding used for level l . The sixth level does not have a max-pooling layer. The second part has two 1024-dense layers with relu as an activation function followed by a dropout layer and an eight-dense output layer for eight parameters of the homography matrix or corner components. Feature maps from the previous part are flattened and given to the second part where homography matrix parameters or corner components are predicted according to the model used. Table II gives detailed information for the first and the second parts of the regression head.

TABLE II Layer-by-Layer Details of the Regression Block as Shown in Figure 3. Levels Indicated by L are Groups of conv2D + BatchNormalization + Relu. The Conv2D Layers in Each Level Have the Same Characteristics and Filter Dimensions. The Number of Used Filters Increases as We Get Deeper in the Architecture

C. Loss

While MMFEB uses similarity loss, we used two loss terms based on the corner error and homography for the regression head.

1) Similarity Loss:

The similarity loss is used to train MMFEB and is defined as follows:

L sim = 1 x * y \sum x = 0 n \sum y = 0 n (f' RGB (x, y) - f IR (x, y)) 2 (6)

View Source

where fIR/RGB(x,y) is the value at (x,y) location for respective image feature maps. f′RGB(x,y) is the value at (x,y) location on the resampled RGB feature maps. Note that the (x,y) is a location on the coordinate system constrained by the infrared image height and width. The algorithmic details of the similarity loss are provided in Algorithm 3.

Algorithm 2 Training Step of the Regression Block

let M be RegressionBlock

Inputs:

I∗RGB,I∗IR ▹ * indicates whole training set

for e←0 to epochs do

for batch←0 to datasetSize/batchsize do

IRGB=I∗RGB[batch]

IIR=I∗IR[batch]

H←groundTruthHomography

fIR←RGBbranch(IRGB)

fRGB←IRbranch(IIR)

Ensure:

fRGB.shape=192×192×64

Ensure:

fIR.shape=128×128×64

fIRpadded=zeroPadd(fIR)

fRGB_IR=concat(fRGB,fIRpadded)

H^=M(fRGB_IR)

Loss=L(H^,H)

Backprop(Loss) using AdamOptimizer

end for

2) L2 Homography Loss Term:

Model B is trained to predict the values of the elements of the homography matrix. Therefore, its output is the eight elements of a 3×3 matrix (where the ninth element is set to 1). The homography-based loss term: LH2 is defined as follows: let [pi : (for i=1,2,3,4,5,6,7,8 ), 1] be the elements of a 3×3 H ground-truth homography matrix. Similarly, let [p^i : (for i=1,2,3,4,5,6,7,8 ), 1] be elements of 3×3H^ , the predicted homography matrix. Then, LH2=(1/8)∑8i=1(pi−p^i)2 , where LH2 represents the homography loss based on the L2 distance.

3) Average Corner Error:

Ace is computed as the average sum of squared differences between the predicted and ground-truth locations of the corner points. For Model B, we use the predicted homography matrix to transform the four corners of the infrared image onto the coordinate system of the RGB image, and together with ground-truth locations we compute LAce . Let ei be a corner at the (xi,yi ) coordinates on the infrared image and let e′i be its warped equivalent on the RGB coordinate space such that e′i=W(ei,P) where W is the warping function

L Ace = 1 4 \sum i = 1 4 D (e i, e' i) 2 = 1 4 \sum i = 1 4 (W (e i, P) - W (e i, P^)) 2 (7)

View Source

where D is defined as D(ei,e′i)=W(ei,P)−W(ei,P^) , and where P and P^ are ground truth and predicted vectorized homography matrices, respectively. The total loss for ModelB, then, is computed as L=LH2+γLAce , where γ is weight factor (a hyperparameter).

In ModelB, we predict the x and y locations of the four corner points, instead of computing the homography matrix. This makes it possible for the network to learn to predict exact locations (landmarks) instead of focusing on one solution. As shown in our experiments (see Fig. 4 for qualitative and Fig. 5 for quantitative results), Model A converges faster and yields better results while minimizing outliers. We use a slightly modified version of LAce for Model A such that e^i becomes the ground-truth corner coordinate in RGB coordinate space. For Model A, LAce is defined as follows:

L Ace = 1 4 \sum i = 1 4 (e i - e^i) 2 . (8)

View Source

$Fig. 4. - Qualitative results on sample image pairs taken from different datasets. The first two columns show the input image pairs for the algorithms. The target image is $192 \times 192$ pixels and the source image is $128 \times 128$ pixels (which covers a scene that is a subset of the target image). The third column shows the ground-truth version ( $192 \times 192$ pixels) of the source image on the coordinate system of the target image after being warped. The fourth column shows the ground truth (warped) where the source image is overlayed on the target image ( $192 \times 192$ pixels). The remaining six columns show the overlayed results ( $192 \times 192$ pixels), after applying for registration with the algorithms in the order of SIFT, DHN, MHN, CLKN, DLKFM, and our approach, respectively. Visually, each algorithm’s result can be compared to the image in the fourth column.$

Fig. 4.

Qualitative results on sample image pairs taken from different datasets. The first two columns show the input image pairs for the algorithms. The target image is 192×192 pixels and the source image is 128×128 pixels (which covers a scene that is a subset of the target image). The third column shows the ground-truth version (192×192 pixels) of the source image on the coordinate system of the target image after being warped. The fourth column shows the ground truth (warped) where the source image is overlayed on the target image (192×192 pixels). The remaining six columns show the overlayed results (192×192 pixels), after applying for registration with the algorithms in the order of SIFT, DHN, MHN, CLKN, DLKFM, and our approach, respectively. Visually, each algorithm’s result can be compared to the image in the fourth column.

Show All

Fig. 5.

ACE distribution versus count of image pairs for different models is shown on the test set of Skydata.

Show All

In addition to these loss functions, we also used additional loss functions in the MMFEB block during our ablation study. Those functions are LMAE and LSSIM . They are briefly defined below

L MAE L SSIM = 1 x * y \sum x = 0 n \sum y = 0 n ∣ ∣ f' RGB (x, y) - f IR (x, y) ∣ ∣ = 1 - SSIM (f' RGB (x, y), f IR (x, y) (9) (10)

View Source

where SSIM is used as also used and defined in [43].

SECTION IV.

Experiments

In this section, we describe our experimental procedures, used datasets, and our metrics. Below we describe our used datasets.

A. Datasets

In our experiments, we use Skydata1 containing RGB and IR image pairs, MSCOCO [31], Google-Maps, and Google-Earth (as taken from DLKFM [52]), VEDAI [48] datasets. Refer to Table III for more details about the used datasets in our experiments. SkyData is originally a video-based dataset that provides each frame of the videos in image format.

TABLE III Summary of the Used Datasets in Our Experiments Is Given. The Training and Test Datasets are Generated as Explained in Section IV

B. Generating the Training and Test Sets

To train the algorithms, we need unregistered and registered (ground truth) image pairs. For SkyData, we randomly select m frame pairs for each video sequence.

For each dataset that we use, we generate the training and test sets as follows.

Select a registered image pair at higher resolutions.
Sample (crop) regions around the center of the image to get smaller patches of 192×192 pixels. This process is done in parallel for visible and infrared images.
If the extracted patches are not sufficiently aligned, manually align them.
For each pair, select a subset of the IR image, by randomly selecting four distinct locations on the image.
Find perspective transformation parameters that map those randomly chosen points to the following fixed locations: (0,0), (n−1 ,0), (n−1,n−1 ), (0,n−1 ) so that they can correspond to the corners of the unregistered IR image patch, where we assume that the unregistered IIR is n×n dimensional (in our experiments n is set to 128). This process creates an unregistered infrared patch (from the already registered ground truth) that needs to be placed back to its true position.
Use those four initially selected points as the ground-truth corners for the registered image.
Repeat process k times to create k different image pairs. This newly created dataset is then split into training and test sets.
The RGB images are used as the target set and the transformed infrared patches are used as the source set (for both training and testing).

This process is done on randomly selected registered pairs for each dataset. Fig. 6 also illustrates this process on a pair of RGB and IR images. The list of all the used datasets and their details are summarized in Table III.

Fig. 6.

How to select initial corner points on the registered image pairs and how to generate the training data. First, a random image patch is taken from the originally registered IR image. Then, the random corners of that patch are transformed into fixed coordinates and after that, the H matrix (and its inverse) performing that transformation is computed.

Show All

C. Evaluation Metrics

As shown in Table IV, we quantitatively evaluate the performance of our models using Ace and homography error. We compute each algorithm’s result distribution in terms of quantiles, mean, standard deviation, and min-max values for a given test set. Quartiles are a set of descriptive statistics that summarize central tendency and variability of data [49]. Quartiles are a specific type of quantiles that divide the data into four equal parts. The three quartiles are denoted as Q1, Q2 (which is also known as the median), and Q3. The 25% (Q1), 50% (Q2), and 75% (Q3) percentiles indicate that k% of the data falls below the kth quartile (the bottom right illustration in Fig. 7 also illustrates these terms). To find quartiles, we first sort elements in the data being analyzed in ascending order. The first quartile is the number of samples that fall below the dataset size*(1/4) element. Likewise, the second quartile is the count of elements that fall below dataset size * (2/4) and the third quartile is dataset size * (3/4)th element in the sorted dataset. The samples that fall out of (Q1−1.5 * IQR and Q3+1.5IQR) where IQR is interquartile range, are considered outliers. The box plot as in Fig. 7 illustrates the above-mentioned description visually.

TABLE IV Comparative Results of Algorithms on Each Dataset in Terms of Ace. Best Results Are Shown in Bold. The Results Illustrate That Average Traditional SIFT Performed on Average in Datasets of Single or Close Modalities. In Cases Where Enough Pairs Were Not Found SIFT Is Unable to Estimate Homography Matrix. We Assign a Constant 10000.0 as the Error. Learning-Based Algorithms DHN and MHN Directly Predict Homography Matrix Without Learning Common Representation Also Suffers Especially on Datasets Such as Skydata and Google Maps. This Is Due to the Models Being Unable to Create Meaningful Correspondences for Input and Target Images as a Result of the Modality Difference Level. Note That Our Approach Has a Small Standard Deviation as Opposed to LK-Based Approaches. LK Techniques Often Significantly Deviate From the Solution Depending on Number of Iterations They Are Run and Initial Parameters They Receive

Fig. 7.

Each plot shows the ACE for the algorithms including MHN, DHN, DLKFM, CLKN, SIFT, and ours for a different dataset. The legends used in the plots are also given in the lower right corner of the figure.

Show All

Table V shows an ablation study on using different loss functions in each block in our architecture. The used metric in the table is Ace and the best values are shown in bold. The loss functions in each row are used to train the MMFEB block including Lsim , LMAE , and LSSIM . The loss functions used for the regression block are LAce and LH2 . In the table, the last column shows the average error for both LAce and LH2 (over two datasets including SkyData and VEDAI) for each of the used loss functions in the MMFEB block.

TABLE V Ablation Study on Using Different Combinations of Loss Functions on Two Different Datasets. The Loss Functions Shown in Each Row are Used for the MMFEB Block, and the Loss Functions Shown in Each Column ( LAce and LH2 ) Are Used for the Regression Block in Our Model. Best Results Are Shown in Bold. Ace Is the Metric Used to Compute the Results for Each Loss Function Combination. The Last Column Shows the Average Ace Value for Each Loss Function Used in the MMFEB Block. On Average, Lsim Yielded the Best Results

$Table V- Ablation Study on Using Different Combinations of Loss Functions on Two Different Datasets. The Loss Functions Shown in Each Row are Used for the MMFEB Block, and the Loss Functions Shown in Each Column ( ${\mathcal{L}}_{\text{Ace}}$ and ${L}^{H}_{2}$ ) Are Used for the Regression Block in Our Model. Best Results Are Shown in Bold. Ace Is the Metric Used to Compute the Results for Each Loss Function Combination. The Last Column Shows the Average Ace Value for Each Loss Function Used in the MMFEB Block. On Average, ${\mathcal{L}}_{\text{sim}}$ Yielded the Best Results$

Next, we provide experimental results on the effect of the hyperparameters that we studied for both Model A and Model B. Table VI summarizes those results. In particular, we studied the effect of using different loss functions (L1 , L2 , and LAce ) and using different batch sizes for both models. All the experiments were done on the SkyDataset. The best results are shown in bold. Overall, Model B showed promising results achieving better results when compared to Model A. Therefore, for the rest of our experiments, we kept using Model B only.

TABLE VI Comparison of the Results of Model A and Model B at Various Hyperparameters Including Batch Sizes and Loss Functions. The Top Table Shows the Homography Error in (a), While the Bottom Table Shows the Results as ACE in (b). Note That the Results Illustrating Low Homography Error Does not Necessarily Imply Small Registration Error. We Study the Effect of Bacth Sizes and Loss Functions. Directly Predicting Homography Matrix Work but It Does Not Minimize the Registration Error as Predicting Direct Corners in Our Experiments. These Results Are Obtained on Skydata Dataset

Fig. 7 uses a box plot, also known as a box-and-whisker plot, to display the distribution of ACE for different datasets and different models. It provides a summary of key statistical measures such as the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. The length of the box indicates the spread of the middle 50% of the data. The line inside the box represents the median (Q2). The whiskers extend from the box and represent the variability of the data beyond the quartiles, in our case, they represent Q1−1.5∗IQR and Q3+1.5∗IQR . Individual data points that lie outside the whiskers are considered outliers and are plotted with diamonds. The figure compares the results for six algorithms on five different datasets.

Fig. 5 shows the performance of six methods (SIFT, DHN, MHN, CLKN, DLKFM, and Ours) on the SkyDataV1 dataset, in terms of ACE. Skydata has RGB and infrared image pairs. In this figure, we aim to show that feature-based registration techniques such as SIFT perform poorly, whereas methods that leverage neural networks and learn representations are superior.

Fig. 4 gives detailed qualitative results of our experiments. Each row represents a sample taken from a different dataset. The columns represent inputs and results for different approaches. Target is (192×192 ) (first column) and source (second column) (128×128 ) are input image pairs. Warped (third column) is the ground-truth projection of the source to the coordinate system of the target image and Registered (fourth column) is the warped image overlayed on the target image as shown. Columns 5–10 show the registered and overlayed results for SIFT, DHN, MHN, CLKN, DLKFM, and Ours (Model A) for the given input pair. While almost all algorithms relatively well on Google Earth pair (which provides similar modalities for both target and source images), when the modalities are significantly different, as in the SkyData, Google Maps, and VEDAI pairs, the figure shows that SIFT, CLKN, MHN, DHN, and DLKFM algorithms can struggle for aligning them and they may not converge to any useful result near the ground truth (see SIFT and CLKN results), while our approach converges to the ground truth by yielding small ACE error for each of those sample pairs.

Table IV illustrates the results of using different approaches for each dataset, separately. In Table IV(e), the MSCOCO results being a single modality dataset, SIFT performs relatively better but there are cases where the algorithm could not find homography due to insufficient pairs. Google earth in (c) also has RGB image pairs but from different seasons. The SIFT algorithm is still able to pick enough salient features, therefore, the performance is still reasonable. (d) Google maps, (a) SkyData and (b) VEDAI have pairs of significant modality difference. Deep-learning-based approaches were able to perform registration often with a high number of outliers. Our approach was able to perform registration on both single and multimodal image pairs, specifically, we were able to keep the max error minimum as opposed to LK-based approaches.

SECTION V.

Conclusion and Discussion

In this article, we introduce a novel image alignment algorithm that we call VisIRNet. VisIRNet has two branches and does not have any stage to compute keypoints. Our experimental results show that our proposed algorithm performs state-of-the-art results when it is compared to the LK-based deep approaches.

Our method’s main advantages can be listed as follows.

Number of iterations during inference: The above-mentioned LK-based methods (after the training stage), also iterate a number of times during the inference stage, and at each iteration, they try to minimize the loss. However, those methods are not guaranteed to converge to the optimal solution and often number of iterations, chosen as a hyperparameter, is an arbitrary number during the inference stage. Such iterative approaches introduce uncertainty for the processing time, as convergence can happen after the first iteration in some situations and after the last iteration in other situations during inference. Such uncertainty also affects the real-time processing of images, as they can introduce varying frame-per-second values. Our method uses a single pass during inference to make it more applicable to real-time applications.
Dependence on the initial H estimate: In addition to the above-mentioned difference, the LK-based algorithms require an initial estimate of the homography matrix and the performance (and number of iterations required for convergence) directly depend on the initial estimate of H and, therefore, it is typically given as input (hyperparameter). While we also have initialization of the weights in our architecture, we do not need an initial estimate of the homography matrix within the architecture as input.

Image alignment on image pairs taken by different onboard cameras on UAVs is a challenging and important topic for various applications. When the images to be aligned are acquired by different modalities, the classic approaches, such as SIFT and RANSAC combination, can yield insufficient results. Deep-learning techniques can be more reliable in such situations as our results demonstrate. LK-based deep techniques have recently shown promise, however, we demonstrate with our approach (VisIRNet) that without designing any LK-based block, and by focusing only on the four corner points, we can sufficiently train deep architectures for image alignment.

ACKNOWLEDGMENT

This article has been produced benefiting from the 2232 International Fellowship for Outstanding Researchers Program of TUBITAK (Project No: 118C356). However, the entire responsibility of the article belongs to the owner of the article.

Aerospace Electronic and Defense Systems

Thursday, March 7, 2024

VisIRNet: Deep Image Alignment for UAV-Taken Visible and Infrared Image Pairs | IEEE Journals & Magazine | IEEE Xplore

Summary

Authors

Artifacts and Data

VisIRNet: Deep Image Alignment for UAV-Taken Visible and Infrared Image Pairs | IEEE Journals & Magazine | IEEE Xplore

Abstract:

Introduction

Related Work

Proposed Approach: VisIRNet

A. Preliminaries

1) Perspective Transformation:

B. Network Structure

1) Multimodal Feature Embedding Backbone:

Algorithm 1 Training Steps of the MMFEB

Algorithm 3 Computing the Lsim loss

2) Regression Block:

C. Loss

1) Similarity Loss:

Algorithm 2 Training Step of the Regression Block

2) L2 Homography Loss Term:

3) Average Corner Error:

Experiments

A. Datasets

B. Generating the Training and Test Sets

C. Evaluation Metrics

Conclusion and Discussion

ACKNOWLEDGMENT

No comments:

Post a Comment

The Forever Chemical Crisis

Report Abuse

Labels