Detection of SAR Image Multiscale Ship Targets in Complex Inshore Scenes Based on Improved YOLOv5 | IEEE Journals & Magazine | IEEE Xplore
Synthetic aperture radar (SAR) can operate around the clock and in all weather, and therefore high-resolution SAR images have been frequently applied for ship inspection. However, current ship target detection and identification methods have limited detection accuracy and lead to missing detection of small targets due to speckle noise caused by the imaging principle of SAR imagery and complex nearshore interference.
Therefore, this article proposes an improved YOLOv5 method to address the problem of low accuracy in multiship target detection tasks in complex scenes. The developed scheme enhances the ship target detection performance while reducing the number of parameters. Specifically,
- first, we improve the size of the input SAR images and optimize the anchor frames of ship targets in the training dataset to locate small target ships more accurately.
- Then, asymmetric pyramidal nonlocal block and sim attention mechanism are introduced to reduce nearshore background interference.
- Additionally, to make the C3 module output richer and with more feature information, channel shuffling is performed after the C3 output to enhance the information exchange between channels.
- Finally, to reduce the number of parameters and computational cost during model training, the normal convolution in the neck part is replaced with Ghost convolution.
The index F 1 of the proposed method on the high-resolution SAR image dataset and SAR ship detection dataset reaches the highest of 91.3% and 95.8%, respectively. MAPs (0.5:0.95) for the two datasets are both the highest, which are at least 2% higher than the suboptimal method.
In selected specific inshore scenes, the ship detection performance of the proposed method outperforms current advanced methods for multiscale ships. It is shown that the proposed method can extract ship features effectively in complex scenes and its effectiveness is further validated on the large-scene AIR–SARShip-1 dataset.
Introduction
With the development of synthetic aperture radar (SAR) technology, SAR image data explodes, with SAR image containing more information. Therefore, SAR image target detection need further in-depth and detailed research. In recent years, the in-depth development of deep-learning theory has led to a worldwide research boom on SAR image detection technology. Deep-learning methods can integrate the detection steps by automatically extracting important features from different targets and reducing the time required to select features and classifiers. Deep learning based on convolutional neural networks (CNNs) is considered one of the most comprehensive SAR image classification and detection approaches. Deep CNN extracts SAR image target features with powerful feature representation capability, avoiding the limitations of handcrafted target features and significantly reducing the workload. Thus, deep CNN-based methods can effectively complete target recognition tasks in various scenes. Therefore, deep-learning methods are applied to SAR image target detection. Using deep-learning methods has the advantages of low workload, high flexibility, strong generalization ability, and high accuracy. These features reduce the difficulty of SAR image target detection and pose an important research significance and practical value for agricultural and forestry management, military target monitoring, and disaster prevention.
CNN-based target detection models can be classified into two types as follows.
One-stage detectors: YOLOv3 [1], YOLOX [2], YOLOv4 [3], YOLOv5 [4], fully convolution one-stage object detection (FCOS) [5], and detection transformer (DETR) [6]. Among them, anchor-based detectors include YOLOv4 and YOLOv5, and anchor-free detectors include FCOS and DETR
The above CNN target detection models comprise four main parts: input, the backbone, which is used for image feature extraction, the neck, which enhances the utilization of the backbone-extracted features, and the head, which detects the object's class and bounding box. In the field of SAR ship detection, many algorithms have been proposed for the improvement of the detection performance of the model. Lots of deep-learning-based SAR ship detection models involve attention mechanism, which are often used for multiscale feature extraction to detect ship targets of different sizes. Yang et al. [9] proposed a robust one-stage detector (ROSD) to mitigate the interference of complex backgrounds. ROSD introduced the coordinate attention and perceptual field increase modules to locate and distinguish between ship targets accurately. Moreover, Sun et al. [10] and Yang et al. [11] improved target localization performance in complex scenarios by developing category-position modules and multilayer feature attention mechanisms based on FCOS. Based on YOLOv5, Zhu et al. [12] added a prediction head to acquire more underlying features in the last part. In addition, the original prediction head was replaced by a transformer prediction head (TPH) with a self-attentive mechanism to improve the effect of multiscale target detection and improve further the model's detection capability. A novel ship target detector called squeeze and excitation rank (SER) faster R-CNN was proposed in [13]. A multiscale feature map concatenation strategy was used to improve the quality of shared feature maps extracted by CNN, with detection performance further enhanced by using squeeze and excitation mechanism. In ship detection, contextual information can help the model better understand the location, size, and shape of objects. Most of the above research methods capture global context information by adding attention mechanisms, thereby obtaining richer feature representations to improve detection accuracy.
Most of the deep-learning-based SAR ship detection models incorporate a feature pyramid network (FPN), which can effectively improve the ability to handle multiscale targets. A full-level context-squeezed excitation region of interest extractor was proposed in [14] to extract feature subsets at each level of FPN to retain multiscale features. A dense attention pyramid network for SAR images was proposed in [15], where the convolutional block attention module is introduced to each branch of the pyramidal network to obtain more semantic information for multiscale ship detection. A feature aggregation enhancement pyramid network and a new method called attention perception pyramid network were proposed in [16] and [17] to improve multiscale ship detection performance in SAR images. In [18], a feature fusion network based on taskwise attention FPN was designed to enhance multiscale feature representations. Through feature fusion, the model can simultaneously capture high-level semantic information and low-level detailed information to improve the detection capabilities of multiscale targets. There are some other multifeature fusion methods besides the use of FPN and its improvements. Xiao et al. [19] proposed a power transform and feature alignment bootstrap network to improve the effect of feature fusion, thereby improving multiscale detection capabilities. In [20], a CNN ship detection algorithm based on multiscale rotation-invariant Haar-like feature integration was proposed. This method was used for ship detection in a multitarget environment in SAR images to improve detection accuracy and performance. In [21], a novel YOLO-based arbitrary-oriented SAR ship detector using bidirectional feature fusion and angular classification was proposed. The model improved information interaction in the feature maps, which is helpful for detecting multiscale ships. A saliency-guided single shot multibox detector for target detection in SAR images was designed in [22]. The dense connection structure integrates lower level features and higher level features, which can introduce more context information. Multiscale fusion feature maps were used in the fully convolutional detection subnetwork in [23] to realize the fusion and comprehensive utilization of features, thereby enhancing the expression ability of multiscale features.
In SAR image ship detection, due to the complexity of the scene, there is an imbalance of positive and negative samples. During the training process, the model can more easily learn how to detect large targets while ignoring small targets, increasing the missed rate and affecting detection performance. An efficient YOLOX-based ship detection model was developed in [24] to solve the problem of high missed detection rate of small targets. Moreover, Zhang et al. [25] suggested a novel quad-FPN for SAR ship detection to enhance the feature extraction of small-sized ship targets. In [26], a deep ship detection method that learns from scratch was proposed with better performance in detecting small dense ships. A large-scale SAR image ship detection method with SSE attention module was proposed in [27] to avoid missing small targets and extract stronger semantic features. In [28], a CFAR-guided CNN was proposed to reduce the problem of missed detection of small ship targets. In [29], a contextual-region-based CNN with multilayer fusion was proposed to improve the detection performance of small ships. In [30], a local and global context fusion module was designed to retain more shallow features to improve the detection performance of small targets.
Most deep CNN-based ship target detectors focus on the detection performance and ignore computational complexity. Thus, Zhou et al. [31] proposed a lightweight anchorless ship detection network for SAR images. A novel target detection method was proposed in [32] to reduce the SAR ship detection time and space complexity. The 3S-YOLO network was proposed in [33] to improve the real-time performance of model detection. 3S-YOLO is a lightweight feature extraction and fusion network, ensuring the model's detection accuracy. In [34], the authors augmented the YOLOv4 backbone with a lightweight module and a coordinated attention mechanism to improve model detection performance. A SAR ship detection model named mask efficient adaptive network was proposed in [35]. Using a lightweight network structure effectively reduced calculation parameters and improved model detection accuracy. Improved YOLO models were proposed in [36], [37], and [38] to achieve model compression while improving accuracy. A lightweight backbone network based on deep dense sim attention mechanism (SimAM) network was introduced in [39]. Results demonstrate that the proposed algorithm performs well in terms of speed and accuracy and has better robustness and real-time performance compared to similar detection algorithms.
Due to SAR imaging characteristics, complex ship movement may lead to the defocus of image azimuth, which brings difficulty to detect ship targets accurately. The inshore background presents numerous interference factors, such as little and scattered buildings, containers, vehicles, short plank roads, cranes, and some hatch covers, which lead to the low accuracy of ship target detection. Furthermore, in such complex inshore areas, ship targets with multiple size information, namely multiscale targets, will increase the difficulty of ship detection effectively. Meanwhile, some dense small targets are easily ignored and missed. These factors make ship detection more complex and challenging. In order to adapt to multiscale ships, the algorithms must have rich feature representation capabilities to handle multiscale target detection. At the same time, for different scenes with small ship targets, detection algorithms need to be more sensitive to small targets. In addition, the large number of parameters in the network structure consumes a lot of computing resources, which makes it unsuitable for practical scenarios. Although current deep-learning-based SAR ship detection methods have achieved impressive results, there still exist some problems and challenges to achieve high detection accuracy in such complex scenes. Multiscale feature extraction methods may not integrate the feature information at different levels effectively. During the actual fusion process, the features at different scales cannot be strictly aligned. Moreover, they are unable to alleviate the concealment of high-level semantic information on the details of small targets. The above small ship target detection methods may not be valid to the multiscale targets in the same or different scenes. In addition, the balance between model computational complexity and performance remains to be investigated.
Aiming at the above problems, this article develops an improved YOLOv5s method to address the problem of low accuracy of ship target detection in complex inshore scenes. The main contributions of this article can be summarized as follows.
In order to more accurately locate small target ship information and reduce missed detections, we modify the sizes of the input SAR images and optimize the anchor frames of ship targets in the training dataset.
Asymmetric pyramidal nonlocal block and SimAM attention mechanisms are introduced to reduce nearshore background interference. Channel shuffling is performed after the C3 output to enhance the information exchange between channels, which helps to obtain more feature information to enhance the model's multiscale target detection capabilities and further improves detection accuracy.
Considering the number of parameters and computational cost during model training, the normal convolution in the neck part is replaced with Ghost convolution to make the model structure lighter.
Experimental results on a high-resolution SAR images dataset (HRSID) and the SAR ship detection dataset (SSDD) datasets show that ship targets can be effectively detected in different scenarios. Especially the detection performance for multiple targets in complex scenes is superior to other advanced methods. Moreover, the generalization of this article's method is also verified on the large-scene AIR-SARShip-1 dataset.
The rest of this article is organized as follows. Section II presents the YOLOv5 network, and Section III introduces the improved YOLOv5 framework. Section IV evaluated the proposed method on the HRSID, SSDD, and AIR-SARShip-1 datasets, demonstrating our method's effectiveness on ship target recognition in SAR images. Section V presents discussions. Finally, Section VI concludes this article.
Related Work
A. YOLOv5
YOLOv5 is a real-time target detection algorithm that inherits the advantages of the YOLO series and optimizes them in many ways. Its main advantages include real-time, high accuracy, scalability, ease of use, automatic data enhancement, and multiscale prediction. Therefore, we will employ YOLOv5s as an example for a detailed introduction. The structure of YOLOv5s is illustrated in Fig. 1 comprising four parts. The first part is the input, where the image is input and preprocessed. The second part is the backbone, used for image feature extraction. The third part is the neck, which enhances the utilization of the backbone-extracted features. The last part is the head, which predicts the class and bounding box of the object.
The input of YOLOv5 is optimized with Mosaic data enhancement, adaptive anchor frame calculation, and adaptive image scaling methods. Mosaic data augmentation effectively enriches the dataset and reduces GPU usage by stitching four images with on-the-fly scaling, on-the-fly cropping, and randomized rows. YOLOv5 reduces training time and speeds up inference by adaptively computing the optimal anchor frame for the SAR image dataset and adaptively adding minimum black borders to unify the input image size.
The backbone part comprises focus, C3, and spatial pyramid pooling (SPP). The input–output image is sliced by focus, with the related FCOS slicing operation presented in Fig. 2. The focus structure slices the input image, performs double downsampling and quadruple channel expansion, and produces the final feature map by convolution.
The C3 structure is widely used in backbone and neck, with its structure illustrated in Fig. 3. Compared with the traditional residual module, the C3 structure has significant advantages. Besides, the C3 structure is divided into two parts: the stacking of base modules and feature mapping, which reduces the computational effort during training and enables a richer gradient combination.
Fig. 4 depicts the SPP module in backbone. First, the backbone outputs the feature maps through four maximum pooling layers at different scales to obtain rich contextual features and multiscale information. Then, these pooled feature maps are stitched through concatenation to form a more expressive and comprehensive feature map. This structure assists in capturing the target's information at different scales and improves the model's performance in target detection tasks.
The neck structure of YOLOv5 comprises a path aggregation network (PAN) and an FPN. The neck aims to fully use the features extracted by the backbone network to improve the target detection performance. The neck's structure is depicted in Fig. 5, highlighting that the FPN structure in the neck is combined with the PAN structure to enhance feature fusion capability. FPN upsamples and fuses feature images at different scales by a top-down approach to deliver rich semantic features, which helps the detection network to handle targets of different sizes. At the same time, PAN obtains more accurate target location information in the high-level feature map by passing the localization information of the image from the bottom upwards, allowing the network to obtain more accurate target location information, which helps improve detection accuracy, especially for small and dense targets. Thus, YOLOv5 attains higher accuracy and robustness when dealing with targets of different sizes and densities.
The head section in YOLOv5 is responsible for the final target detection and localization tasks, using a mechanism similar to anchor boxes to associate each prediction box with a predefined size and scale. The prediction box outputs the category probability of the target, the bounding box coordinates, and the confidence level of a target's presence. These predictions are threshold-filtered and nonmaximal suppressed to determine the detected targets, locations, and class. The loss function of YOLOv5 consists of two parts: the categorical loss function and the bounding box regression loss function. Among them, the binary cross-entropy loss function (BCELoss) is used for the YOLOv5 categorical loss function, and the CIoU Loss is used for the bounding box regression loss function. The BCELoss is defined as
The CIoU loss function is written as
B. Selection of the YOLOv5 Model
In YOLOv5, the number of parameters is increased or deleted by the depth-multiple and width-multiple control, leading to four different YOLOv5 versions: YOLOv5l, YOLOv5m, YOLOv5x, and YOLOv5s. The depth-multiple and width-multiple are reported in Table I.
In order to obtain the best detection performance of SAR image ship targets, these four YOLOv5 variants were experimentally compared and analyzed based on HRSID. The corresponding experimental results are reported in Table II, where P represents precision, R is recall, AP0.5 denotes the average accuracy at IOU values greater than 0.5, and MAP (0.5: 0.05: 0.95) represents the average MAP over different IOU thresholds (from 0.5 to 0.95 in steps of 0.05).
Table II suggests that as the width and depth of the YOLOv5 model increase, the model's number of parameters and network layers increase rapidly while the detection of SAR image ship targets are less affected. The difference between P and MAP of YOLOv5s and YOLOv5m, YOLOv5l, and YOLOv5x is insignificant, but R and AP0.5 are the highest among the four models. Moreover, YOLOv5s has the lowest number of parameters, reducing the network's training time. Therefore, this article performs network optimization based on the YOLOv5s model.
Improved YOLOv5
A. Proposed Method
Current ship target detection and identification methods suffer from reduced detection accuracy and small target ship missed detection due to speckle noise and coastline interference caused by the imaging principle of SAR images. Therefore, this article proposes an improved YOLOv5s method to address the problem of low accuracy of multiship target detection in complex scenes. Specifically, first, by increasing the resolution of the input SAR images and optimizing the anchor frame of the ship targets in the training dataset, the detection performance of the model for small targets is improved to locate small target ships more accurately and reduce the missed detection rate. Then, asymmetric pyramidal nonlocal block (APNB) and SimAM attention mechanisms are introduced to reduce shoreline background interference and locate ship targets more accurately. These two attention mechanisms help the model to better focus on the target area in complex backgrounds, further improving detection accuracy. In addition, to make the output of the C3 module richer and with more characteristic information, the output of C3 is channel-shuffled to enhance the exchange of information between channels. The channel shuffling operation allows the model to use features from multiple channels, improving feature representation and thus helping the model perform better detection in multiscale and diverse targets. Finally, to reduce the number of parameters and computational cost during model training, the normal convolution of the neck part is replaced with a Ghost convolution. Ghost convolution effectively reduces the model's number of parameters and computational complexity while maintaining high performance. Hence, the model can achieve good detection results even with limited computational resources. The improved YOLOv5s model proposed in this article is depicted in Fig. 6, which is optimized to meet the challenges in SAR image ship target detection. By improving the resolution, optimizing the anchor frame, introducing the attention mechanism, adding channel blending, and adopting Ghost convolution, the model is comprehensively improved regarding detection accuracy, missed detection rate, feature expression capability, and computational cost.
B. Image Preprocessing Before Input
In target detection tasks, large feature maps often detect small targets and vice versa. Therefore, on a large feature map, the anchor value is small to detect small targets and has a large value on a small feature map to detect large targets. The initial anchor in the YOLOv5 model adopts the one designed for the COCO dataset. However, the COCO dataset contains targets with large size differences, and thus the anchor generated by the COCO dataset is unsuitable for detecting SAR ship targets with small target size differences. Hence, to better detect and identify SAR ship targets, the ship target anchor size in HRSID must be optimized. The YOLOv5 algorithm detection performance is improved by using k-means clustering to train all labeled ship targets in HRSID and obtain the most suitable anchor for ship targets. The anchors before and after optimization are illustrated in Fig. 7.
The backbone and neck parts of the network are responsible for the feature extraction of YOLOv5. Indeed, the backbone extracts features through a 3 × 3 convolution. Backbone's output features pass through the neck SPP structure for feature extraction with different receptive fields. Since backbone uses a 3 × 3 convolution with a fixed perceptual field when extracting features, it is easy to cause problems, such as small target loss and difficult position localization, when the input image resolution is too small. If a small target is on the SAR image, its input size is converted into 640 × 640 after downsampling 32 times using a convolution operation. However, the position information of the small target in the original image may be lost, and only the abstract information of the small target is retained. If the input SAR image size is enlarged to a size exceeding 640 × 640, the position information of the small target can still be perceived by YOLOv5 after multiple convolutional downsampling, better locating the small target position. Therefore, increasing the input image resolution will increase the size of small targets and alleviate the difficulty of losing small target location information, improving the detection accuracy of small targets.
C. Attentional Mechanisms
The background of high-resolution SAR images is very complex, and for ship targets near the shore, there is still the problem of a high number of missed detections and false alarms during ship detection. Therefore, to improve the detection capability of YOLOv5 near the shore, we introduce the APNB and SimAM attention mechanisms [40]. The convolution kernel performs the convolution operation to extract features only in the local region, while APNB obtains the weight relationship between the target to be recognized and other regions globally.
Therefore, APNB is added to the last layer of the backbone part to improve the model's feature extraction capability for targets in complex backgrounds. In the complex nearshore context, APNB is more conducive to the model to identify ship targets and nearshore false targets, reducing the number of missed detections and improving the accuracy of detection and identification. The detailed block diagram of APNB is depicted in Fig. 8.
Therefore, the general formula for nonlocal operations is written as follows:
The equation is written as follows to convert the nonlocal operation to APNB
Nevertheless, the currently used modules, channel attention, spatial attention, and hybrid attention, suffer from two problems. The first is that channel and spatial attention can only obtain better features in one dimension of the channel or space, and the spatial variation of the two dimensions of hybrid attention lacks flexibility. Second, they add a certain amount of computation when they are added to the model for training. Therefore, we introduce the SimAM attention mechanism in the C3 module in YOLOv5. SimAM also infers three-dimensional (3-D) attention weights for the feature maps without increasing the network parameters to obtain better features. The detailed block diagram of SimAM is presented in Fig. 9.
SimAM defines each neuron as
The equations for
After substituting
The lower value of
SimAM is added to C3, which is the module that is most used in YOLOv5. Also, to make the feature information in the C3 output richer, we shuffle the C3 output to enhance the information flow between channels. The flow chart of channel mixing and washing is illustrated in Fig. 10. First, the channels of the feature map are divided into N groups, which are transposed after transforming the dimension by a reshape operation. Finally, the transposed channels are flattened. The improved C3 module is depicted in Fig. 11.
D. Neck Lightweight
Although the improved feature extraction module improves the detection performance of small targets in SAR images, the number of parameters and computational cost for model training are also increased. Therefore, we replace the normal convolution in the neck part with Ghost convolution to extract better deep features while reducing the number of parameters and computational cost. The Ghost convolution [42] module was proposed by Huawei Noah's Ark Lab and published in CVPR 2020. The principle of Ghost convolution is presented in Fig. 12. Compared with the general convolution operation, the Ghost module first uses 1 × 1 convolution to compress the incoming feature map channels, and then uses linear operation to get more feature maps. Finally, the result after 1 × 1 convolution and after the linear operation are stacked to obtain a new feature map.
Ship Target Detection Results
A. Datasets and Evaluation Metrics
HRSIDs are released in January 2020 by Su from the University of Electronic Science and Technology [43]. HRSID is a dataset for ship detection and instance segmentation in high-resolution SAR images. There are 5604 images and 16 951 ships in HRSID from Sentinel-1 and TerraSAR-X, and they are cropped to 800 × 800 pixels. The resolutions of the images contain 0.5, 1, and 3 m and the polarizations cover HH, HV, and VV. In the process of performing HRSID, we set the batch size as 16, the epoch as 300, and the ratio of training set to test set as 13:7.
SSDD has 1160 images and 2456 ship targets and the image size is approximately 600 × 600 pixels [46], with an average of 2.12 ships per image. During training, images with labels ending in 1 and 9 are used as the validation set and the rest as the training set.
The high-resolution SAR ship dataset AIR-SARShip-1 has 31 images with an image size of about 3000 × 3000 pixels. Image resolutions include 1 and 3 m, and the scene types include ports and islands. The sea surface has different sea state levels. The targets involve more than 10 types of ships, such as transport ships, oil tankers, and fishing vessels, which are nearly 1000 ships in total [45].
In this article, the evaluation indexes include accuracy, precision rate (P), recall rate (R), F1, and MAP, which are used to comprehensively evaluate the performance of the proposed method in SAR image ship targets detection. The average accuracy AP is the area under the PR curve, which can be used to evaluate the overall performance of the detector [44]. Especially for the ship target, the average accuracy AP is equal to the class average accuracy MAP. The higher the MAP value, the better the performance of the detector. AP0.5 and AP0.75 denote the accuracy with a 0.5 IOU and a 0.75 IOU threshold, respectively. MAP (0.5:0.95) represents the average of different IOU thresholds from 0.5 to 0.95 with an interval of 0.05. Accuracy, precision rate (P), recall rate (R), F1, and MAP are expressed as
The experimental setup is presented in Table III.
B. Detection Results for Different Input Image Sizes
The experimental results based on YOLOv5s with different input image sizes are shown in Table IV. From Table IV, in the training results of HRSID, R and AP0.5 reach the highest at 90.1% and 94.3%, respectively, when the input image size is 800 × 800, which are 1.9% and 0.5% higher than the original input of 640 × 640. In the training results of SSDD, P and AP0.5 reach the highest 96.7% and 97.4%, respectively, when the input image size is 800 × 800, which are 1.1% and 0.2% higher than the original input of 640 × 640. When processing the low-size SAR images of being 480 × 480, AP0.5 and AP0.75 for both the datasets are absolutely low. MAP for both of the datasets show an increasing trend when the input SAR image size increases from 480 × 480 to 800 × 800. However, MAP decreases evidently while the image size varies from 800 × 800 to 960 × 960. Experiments show that the small size of input images can easily lead to small target missions and is not conducive to detecting more ship targets. Therefore, in the subsequent experiments, we set the input image size of the improved method as 800 × 800 to improve the detection performance of small targets.
C. Ablation Experiments
To verify the effect of each module in the method of this article on ship detection, ablation experiments based on the HRSID dataset with an input size of 800 × 800 were conducted, and the experimental results are shown in Table V. Compared to YOLOv5s, recall rate improvement 1 is reduced by 1.7%, while other metrics have improvement. This can be attributed to the adaptive anchor box optimization, which generates anchor boxes that are more suitable for ship targets, thereby enhancing the detection capability for small-sized ships. Compared to improvement 1, improvement 2 has increased the recall rate by 1.8%, showing a significant enhancement over the original method. The introduction of APNB makes improvement 2 effectively improve the accuracy of ship detection in complex nearshore backgrounds and reduce the number of missing ship detections by capturing global information. To obtain the 3-D features of the image and avoid an increase in parameter quantity, SimAM attention is introduced based on improvement 2 in improvement 3. Compared with improvement 2, the recall rate of improvement 3 is decreased by 2.6%, but the precision rate is increased by 2.3%, significantly reducing the number of falsely identified ship targets. Inspired by ShuffleNet, improvement 4 adds channel shuffling on the basis of improvement 3 to enhance the flow of channel information between C3 output features. Compared with improvement 3, improvement 4 maintains high accuracy while improving recall rate and AP. The proposed method introduces Ghost convolution in neck part of improvement 4, and achieves the highest F1, AP0.5, AP0.75, and MAP for detection results. The reason is that the neck part of YOLOv5 is used to fuse features at different scales. The feature reuse mechanism of Ghost convolution may help to fuse features from different levels more effectively in the neck part, improving the detection performance of objects at different scales. At the same time, the parameter quantity has decreased relatively compared to improvement 4. In summary, the proposed method is effective in improving ship detection performance, and the interaction between various modules can further enhance our network performance.
D. Analysis of Detection Results Based on CNN Methods
To verify the performance of the proposed method, we conducted experimental comparative analysis on two different datasets with six CNN-based methods including faster R-CNN, YOLOv4, YOLOv5s, TPH—YOLOv5, DETR, and FCOS. First, we compare the proposed method with faster R-CNN, which is a two-stage detector. Combined with RPN, it performs well in small object detection tasks with high accuracy and recall, which can effectively detect and locate small ship targets. Then, YOLOv4, YOLOv5s, and TPH—YOLOv5 are compared with the proposed method, which use single-stage anchor mechanism to achieve a good balance between real-time performance and detection accuracy. TPH–YOLOv5 improves detection accuracy by adding TPH and integrating CBAM into YOLOv5, which is especially suitable for multitarget detection tasks. DETR and FCOS are typical models with the single-stage anchor-free mechanism. DETR provides end-to-end detection for SAR ship detection. Combined with the transformer model, it can effectively capture the global contextual information of the entire image and has good processing capabilities for a small number of multiscale targets. FCOS is a target detection algorithm based on a fully convolutional network. It uses a global self-attention mechanism to help capture the global information of the target and achieve more efficient and accurate target detection in SAR ship detection. We compare the proposed method with these classical and advanced algorithms to verify the performance of the improved model.
Based on the HRSID, the precision, recall, AP0.5, AP0.75, MAP (0.5: 0.95), and parameter size of the proposed method and different CNN methods are shown in Table VI. From Table VI, compared with the YOLOv5s network, the proposed method achieves different degrees of improvement in P, R, F1 score, AP0.5, AP0.75, and MAP while slightly increasing the model parameters. This improves the performance of ship target detection. Compared with other CNN methods, the proposed method achieves the highest P, F1, AP0.5, AP0.75, and MAP (0.5: 0.95) with the least number of parameters. The recall rate of the proposed method is 89.1%, which is 1.1% lower than that of FCOS, but it is significantly better than the recall rate of the other CNN methods. The proposed method improves the recall value and reduces missing detections while maintaining high precision, demonstrating excellent performance in ship detection. The TPH–YOLOv5 method enhances model detection performance by fusing low-level features with deep-level features and introducing CBAM to strengthen features. However, this increases the parameter volume and computational cost, making real-time implementation unfeasible. DETR and FCOS methods have lower MAP (0.5:0.95) on HRSID, which are corresponding to the low precision of these two methods. The DETR method performs poorly in detecting ships when extracting deep-level features and inputting them into the transformer due to the single target nature of the dataset. In addition, the lack of anchor points in FCOS results in lower precision due to complex nearshore backgrounds present in HRSID.
To verify the robustness and generalization ability of the proposed method, we conducted ship detection experiments in SSDD. The experimental results are shown in Table VII. From Table VII, the precision, recall rate, and MAP of the improved method reach 95.2%, 96.4%, and 66.9 respectively, which are significantly better than other CNN methods. This indicates that our method can fully extract ship target features and reduce missing alarms. However, the proposed method's AP0.5 and AP0.75 values are slightly lower than TPH–YOLOv5, mainly because the feature extraction capability of the proposed method is slightly insufficient compared to TPH–YOLOv5 with more network parameters for fewer images in SSDD. DETR and FCOS have significantly lower recall rates than other methods, resulting in a lower MAP. This is because both DETR and FCOS adopt an anchor-free design. Although the anchor-free design can improve positioning accuracy in some cases, it may be difficult to capture all targets in complex scenes.
To further demonstrate the advantages of the proposed method, we compared the detection results of different methods on HRSID and SSDD datasets. We selected various types of targets from HRSID and SSDD, including dense targets in complex nearshore backgrounds, single targets in complex nearshore backgrounds, and multiple targets in simple backgrounds as shown in Fig. 13. Images 1–4 are from the HRSID, and the detection results are shown in Figs. 14–17, respectively. Images 5–7 are from the SSDD, and the detection results are shown in Figs. 18–20, respectively. The blue box represents the actual label position, while the red box represents the detection position based on CNN method.
As shown in Figs. 14–17, several competitor methods will result in missing detections or false alarms when detecting multiple ship targets in complex scenes. However, the proposed method can accurately target some complex scenarios. In simple scenes, multiple methods can effectively detect ship targets. The specific detection results of ship targets under different methods are reported in Table VIII.
Note that the analysis of the results is based on HRSID for multitargets in the complex shore background of image 1, multitargets in the complex shore background of image 2, and single targets in the complex shore background of image 3. For images 1–3, faster R-CNN, YOLOv4, and DETR have poor multitargets detection capabilities in complex scenes. In image 1, all ship targets are missed by YOLOv4, while faster R-CNN misses 11. The reason is that they used the anchor setting of the COCO dataset and lacked data augmentation strategies for small targets. Furthermore, complex backgrounds and dense target distributions affect the performance of these two methods in target detection. As the basic structure of DETR, although transformer has superior performance in some scenarios, it may not be effective for ship target detection with complex backgrounds in high-resolution SAR images. The main reason is that transformer may not fully capture the close relationship between targets when dealing with complex backgrounds, resulting in losing relationships between targets. Compared to other methods, YOLOv5 and FCOS perform better, but there are still some false positives. Both the proposed method and TPH–YOLOv5 perform well in image 1.
However, our method has a better overall performance. In image 2, all ship targets are accurately detected by the proposed method, while TPH–YOLOv5 has alarm and missing detection of ship targets. TPH–YOLOv5 owns large width and depth, allowing it to extract more ship target information in complex backgrounds. SimAM and APNB attention mechanisms have been introduced with improved methods, which enhance the recognition ability of ship targets and reduce the number of false targets. In image 1, there are still three ship targets that were missed by the proposed method. The main reason is that the image's background is too complex, and the ship targets are too dense.
Next, the detection results of simple sea background multitarget image 4 from HRSID were analyzed. Compared with complex scenes, various methods have improved performance in simple scenes. Faster R-CNN, YOLOv4, and DETR performed well in image 4, but some duplicate or false targets remain. In the simple background, all ship targets are accurately detected by the proposed method, YOLOv5s, TPH–YOLOv5, and FCOS. In summary, the multitarget detection capability of faster R-CNN and YOLOv4 is poor in complex scenes. The reason is that they suffer from large interference in dealing with complex backgrounds and dense targets. The proposed method and TPH-YOLOv5 perform better in these complex scenarios than the competitor methods. This is due to the attention mechanism they introduce and their deeper network structure. Most methods perform well in simple scenarios, especially the proposed method, YOLOv5s, TPH–YOLOv5, and FCOS.
Compared with HRSID, the background of the SSDD dataset is relatively simple. Therefore, as depicted in Figs. 18–20, the detection performance of the compared methods is improved when detecting multiple ship targets in complex backgrounds. In the simple background, multiple methods can effectively detect ship targets. The detection results of ship targets based on the CNN method are depicted in Table IX.
The proposed methods, YOLOv5s and TPH–YOLOv5, perform better in complex nearshore background multitarget image 5 and single-target image 6. These methods use deeper and wider network models to extract depth features that benefit target identification. Therefore, they have high detection accuracy for ship targets. DETR, faster R-CNN, YOLOV4, and FCOS have relatively poor detection performance in complex scenes because their network structures and feature extraction capabilities are insufficient to deal with complex backgrounds. In the simple background multitarget detection of image 7, the proposed method, TPH–YOLOv5 and YOLOv5s, performed well, accurately detecting all ship targets. The above methods have good adaptability in simple backgrounds and high detection accuracy. However, the detection performance of faster R-CNN, YOLOv4, and DETR is still unsatisfactory in simple backgrounds, especially when faster R-CNN identifies a single ship target as overlapping multiple targets. These methods’ feature extraction and target resolution capabilities still need further optimization.
In summary, the proposed method, YOLOv5s and TPH–YOLOv5, have demonstrated high detection accuracy in different scenarios and show good adaptability when dealing with complex backgrounds. These methods have significant advantages over other CNN methods mainly due to their deeper and wider network models and outstanding feature extraction and object discrimination capabilities.
E. Comparative Analysis With State-of-the-Art Methods
Table X reports the experimental results of the proposed method and state-of-the-art methods on HRSID and SSDD datasets. Table X highlights that the proposed method achieves optimal results in P and AP0.5 in the HRSID dataset compared to other methods. This is mainly attributed to our detection algorithm, which can capture the ship target features, significantly improving the detection accuracy and precision. At the same time, the proposed method is slightly inferior to DB-YOLO in R, but the overall performance is still appealing. In SSDD, the proposed method AP0.5 achieved 97.3%, which is optimal in terms of performance, and P and R also performed well. This indicates that our method has higher accuracy and recall rate in detecting ships under complex inshore backgrounds and can accurately identify ship targets in highly complex backgrounds. The DB-YOLO method's high recall and low precision are because DB-FPN enhances the semantic and spatial information fusion, which helps capture small-scale targets and improve the recall rate. However, the single-stage network with CSP blocks that reduce redundant parameters may affect the feature representation capability, resulting in a lower precision rate. The proposed method has only 7.6M parameters, which is lower than other detection methods. Thus, the proposed method is more practical in scenarios with limited computational resources and achieves lower computational costs while maintaining high performance.
Compared to various advanced methods, the proposed method obtains optimal or near-optimal results on the HRSID and SSDD datasets. The proposed method performs better precision, recall, and average precision with relatively few parameters. This means the proposed method has low computational complexity while maintaining high performance. Hence, the proposed method has a high practical value and wide potential application prospects in various scenarios.
F. High-Resolution Complex Large-Scale SAR Image Verification
In order to verify the generalization ability of the trained model, ship detection was performed on two large-scale SAR ship images based on the high-resolution SAR ship dataset AIR-SARShip-1. The results of ship detection in high-resolution SAR far-sea and nearshore scenes are illustrated in Figs. 21 and 22, respectively. The blue box indicates missed ship targets, while the green box indicates false ones.
Tables XI and XII report detailed detection results of the proposed method and the four representative CNN methods presented in Section IV-D. Table XI reveals that 47 ship targets were detected by the proposed method in the high-resolution SAR distant sea large-scene multitarget ship detection identification, including 45 real ship targets and 2 false ship targets. It missed five ship targets. So the proposed method performs optimally regarding P, R, and F1 composite index. Specifically, the accuracy rate of the proposed method is 95.7%, which is significantly better than the other methods, indicating that the proposed method generates fewer false alarms in detecting vessels. The R metric reaches 90.0%, which is on par with TPH–YOLOv5, indicating that the proposed method detects real vessels. The F1 is 92.8%, and the combined evaluation of the precision rate and recall rate indicates that the proposed method outperforms other methods regarding overall performance. Fig. 21 suggests that the proposed method has a poor localization of ship target locations. The reason is that we use the model weights trained on HRSID for testing, and the ship targets are relatively single, while the AIR-SARShip-1 targets cover more than 10 categories, such as transport ships, oil tankers, and fishing vessels, resulting in poor localization of ships with large shape differences. Table XII shows that 10 ship targets are detected by the proposed method in the high-resolution SAR distant sea large-scene multitarget ship detection identification, which included 7 real ship targets and 3 false ship targets. It missed two ship targets. According to Table XII, the proposed method performs better regarding precision rate (P), recall rate (R), and F1 composite index. Specifically, the proposed method has an accuracy rate of 70%, which is significantly better than other methods, because the proposed method has been effectively optimized in terms of feature extraction and target localization, which improves the detection accuracy and results in fewer false alarms generated when detecting vessels. The proposed method achieves a recall rate of 77.8%, which is second only to FCOS, indicating that it performs well in detecting real vessels. This is due to its adaptability to the input scales and anchor frames. The F1 is 73.7%, indicating that the proposed method outperforms the other methods’ overall performance. It is worth noting that although the recall rate of FCOS reaches 100%, its high false alarm rate leads to a low precision rate, affecting F1. On the one hand, FCOS abandons anchor boxes and directly detects object-bounding boxes. Although this method simplifies the training process, it may lead to inaccurate detection of bounding boxes and false alarms when dealing with complex backgrounds due to the lack of prior information provided by anchor boxes. On the other hand, for background areas with uneven distribution, FCOS may not be able to distinguish between targets and backgrounds well, resulting in false alarms. Therefore, the proposed method performs better in ship detection tasks from three perspectives, i.e., P, R, and F1, due to its improved network design, feature fusion, and optimization strategies.
Discussion
In this section, the improved YOLOv5 is analyzed based on visualization experiments in order to evaluate the detection performance of the proposed method. And the experiments conducted for different scenes are discussed comprehensively.
A. Analysis of Visual Feature Map Results
In order to intuitively demonstrate the role of our improved module, we conduct feature maps visualization experiment. Fig. 23 shows the heat maps of the improved modules. Fig. 23(a) and (b) shows the original image and the visual feature maps of the original YOLOv5 model, respectively. It can be seen that the target information is not clear and the detection is severely affected. Fig. 23(c) is the visual feature map of adding the APNB module; since APNB obtains the weight relationship between the target to be identified and other regions in the global region, it facilitates the model to better focus on the target region in the complex background. It is obvious that the ship features extracted by adding APNB are more complete and clearer than by traditional YOLOv5 under complex background interference, which greatly improves the accuracy of the model for target detection in complex coastal scenes and effectively highlight the target area of the image. Based on the experiment in (c), Fig. 23(d) shows the visual feature map with SimAM added. In order to mitigate the nearshore background interference to locate the ship target more accurately, the SimAM attention mechanism is introduced. In Fig. 23(d), the right ship target attention area is significantly larger and the influence of the nearshore scene is reduced. The addition of channel shuffling after the C3 output is shown in Fig. 23(e), and it can be seen that the target regions detected in the figure all show stronger highlight effects. This allows the model to make full use of the multichannel functions and to improve the enhancement of feature expression. Based on the experiment in Fig. 23(e), Fig. 23(f) shows the visual feature map with the Ghost convolution added. Because of the feature reuse mechanism of Ghost convolution, the model can be more effective by integrating features from different levels in the neck part, which improves the ability to detect targets of different sizes. In Fig. 23(f), the characteristics of the ship are effectively highlighted, while the characteristics of the background are significantly suppressed. The target area of the ship is more clearly defined. We effectively verify the necessity and importance of our improved module through visual experiments.
B. Analysis of the Experimental Results
In order to demonstrate the superiority of the improved model, we choose to perform experimental detection on the HRSID and SSDD datasets. Compared with different CNN methods, the proposed method based on HRSID has a recall rate of 89.1% and 1.1% lower than FCOS, but significantly better than other CNN methods. The precision and F1 reach 93.4% and 91.3%, respectively. AP0.5, AP0.75, and MAP (0.5:0.95) are significantly improved to 94.6%, 93.9%, and 73.6%, respectively. While maintaining high accuracy, the proposed method improves the value of recall rate. It reduces the number of missed detections and has better ship detection performance. In comparison with different CNNs, the proposed method based on SSDD achieves recall, F1, and MAP (0.5:0.95) of 96.4%, 95.8%, and 66.9%, respectively. It is significantly better than other CNN methods, indicating that this method is more effective at extracting target features and reducing the false alarm rate. To verify the generalization of our method, ship detection was performed on two large-scale SAR ship images based on the high-resolution SAR ship dataset AIR-SARShip-1. According to the test results in Figs. 21 and 22, our method performs well on large-scene SAR images, correctly detecting most of the apparent ship targets in the image. It shows that our method has a high generalization performance and can give satisfactory results in different SAR scenarios.
Conclusion
The proposed method has significantly improved ship target detection tasks of SAR images. First, we optimized the resolution and anchor boxes of input SAR images to improve the ability to locate small targets. Second, nonlocal and SimAM attention mechanisms were introduced to alleviate nearshore background interference and achieve accurate target positioning. Furthermore, the C3 module was improved by adopting a channel shuffling strategy to enhance feature information. Finally, to reduce training costs, ordinary convolutions in the neck part were replaced with Ghost convolutions. The experimental results on HRSID and SSDD datasets show that the proposed method significantly improves the performance of ship target detection with fewer parameters. Especially on the SSDD dataset, the proposed method's precision, recall rate, and MAP outperform all other CNN methods, demonstrating its robustness and generalization ability. Meanwhile, multitarget detection have been achieved in high-resolution SAR large scenes using AIR-SARShip-1 data. Although the detection performance needs to be improved in nearshore scenarios, it performs well in offshore scenarios. In summary, despite some shortcomings in the proposed method, significant progress has been made in ship target detection tasks. Future article will further optimize the network structure and interpretability to improve recognition accuracy while maintaining as few parameters as possible.
No comments:
Post a Comment