Wednesday, April 24, 2024

Smart Roads Get Better Eyesight - IEEE Spectrum

Article Summary

The article discusses a new approach to "roadside perception" that fuses camera and radar data to track vehicles at distances up to 500 meters. This method, developed by researchers at the University of Science and Technology of China (USTC), aims to make smart roads more cost-effective by reducing the number of sensors needed. Key points:

  • 1. Traditional camera and radar systems struggle to track vehicles beyond 100 meters.
  • 2. The USTC team discovered that projecting 3D radar data onto 2D images results in lower location errors at longer ranges compared to mapping image data onto radar data.
  • 3. The new technique boosts the average precision of tracking at shorter distances by 32 percent compared to previous approaches.
  • 4. The researchers built a self-calibration capability into their system to reduce the need for manual recalibration of sensors over time.
  • 5. This roadside perception technology could provide future self-driving cars with valuable situational awareness, extending their perceptual range.

The researchers believe that their approach is practical for real-world deployment and could play a crucial role in future intelligent transportation systems.

Smart Roads Get Better Eyesight - IEEE Spectrum

Smart Roads Get Better Eyesight

A new way of fusing camera and radar data helps track vehicles at greater distances


roadway with cars in both directions with blue, orange and green dots on each
Car tracking data from a radar (green), camera (blue) and a fusion of the two (yellow) captured by the USTC researchers on an express way in Heifei, China

Yao Li

This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

Smart roads with advanced vehicle sensing capabilities could be the linchpin of future intelligent transportation systems and could even help extend driverless cars‘ perceptual range. A new approach that fuses camera and radar data can now track vehicles precisely at distances of up to 500 meters.

Real-time data on the flow and density of traffic can help city managers avoid congestion and prevent accidents. So-called “roadside perception”, which uses sensors and cameras to track vehicles, can help create smart roads that continually gather this information and relay it to control rooms.

“This is the first work that offers a practical solution that combines these two types of data and works in real world deployments and with really challenging distances.” —Yanyong Zhang, University of Science and Technology of China, Hefei

Installing large numbers of road-side sensors can be expensive, though, as well as time-consuming to maintain, says Yanyong Zhang, a professor of computer science at the University of Science and Technology of China (USTC) in Hefei, China. For smart roads to be cost-effective you need to use as few sensors as possible, she says, which means sensors need to be able to track vehicles at significant distances.

Using a new approach to fuse data from high definition camera and millimeter-wave radar, her team has created a system that can pinpoint vehicle locations to within 1.3m at ranges of up to 500m. The results were outlined in a recent paper in IEEE Robotics and Automation Letters.

Y. Li et al., "FARFusion: A Practical Roadside Radar-Camera Fusion System for Far-Range Perception," in IEEE Robotics and Automation Letters, doi: 10.1109/LRA.2024.3387700.

Abstract: Far-range perception through roadside sensors is crucial to the effectiveness of intelligent transportation systems. The main challenge of far-range perception is due to the difficulty of performing accurate object detection and tracking under far distances (e.g., > 150m) at a low cost. To cope with such challenges, deploying both millimeter wave Radars and high-definition (HD) cameras, and fusing their data for joint perception has become a common practice. The key to this solution, however, is the precise association between the two types of data, which are captured from different perspectives and have different degrees of measurement noises. 

Towards this goal, the first question is which plane to conduct the association, i.e., the 2D image plane or the BEV plane. We argue that the former is more suitable because the magnitude of location errors in the perspective projection points is smaller at far distances on the 2D plane and can lead to more accurate association. Thus, we first project the Radar-based target locations (on the BEV plane) to the 2D plane and then associate them with the camera-based object locations that are modeled as a point on each object. Subsequently, we map the camera-based object locations to the BEV plane through inverse projection mapping (IPM) with the corresponding depth information from the Radar data. 

Finally, we engage a BEV tracking module to generate target trajectories for traffic monitoring. Since our approach involves transformation between the 2D plane and BEV plane, we also devise a transformation parameters refining approach based on a depth scaling technique, utilising the above fusion process without requiring any additional devices such as GPS. We have deployed an actual testbed on an urban expressway and conducted extensive experiments to evaluate the effectiveness of our system. The results show that our system can improve APBEV by 32%, and reduce the location error by 0.56m. Our system is capable of achieving an average location accuracy of 1.3m when we extend the detection range up to 500m. We thus believe that our proposed method offers a viable approach to efficient roadside far-range perception.

keywords: {Radar;Cameras;Sensors;Radar tracking;Radar imaging;Millimeter wave radar;Sensor fusion;Sensor fusion;object detection;calibration and identification},
URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10496834&isnumber=7339444


 

“If you can extend the range as far as possible, then you can reduce the number of sensing devices you need to deploy,” says Zhang. “This is the first work that offers a practical solution that combines these two types of data and works in real world deployments and with really challenging distances.”

Where camera-radar fusion becomes necessary

Cameras and radars are both good low-cost options for vehicle tracking, says Zhang, but individually they struggle at distances much beyond 100 meters. Fusing radar and camera data can significantly extend ranges, but to do so involves surmounting a range of challenges due to sensors generating completely different kinds of data. While the camera captures a simple 2D image, the radar output is inherently 3D and can in fact be processed to generate a bird’s eye view. Most approaches to camera-radar fusion to date have simply projected the camera data onto the radar’s birds-eye view, says Zhang, but the researchers discovered that this was far from optimal.

In order to better understand the problem, the USTC team installed a radar and a camera on a pole at the end of a straight stretch of expressway close to the university. They also installed a LIDAR on the pole to take ground truth vehicle location measurements, and two vehicles with high precision GPS units were driven up and down the road to help calibrate the sensors.

two images side by side with poles and boxes attached to them, each circled with a blue square and text 

The researchers installed a camera, radar and LIDAR to track vehicles on an expressway in Heifei, ChinaYao Li

One of Zhang’s PhD students, Yao Li, then carried out experiments with the data collected by the sensors. He discovered that projecting 3D radar data onto the 2D images resulted in considerably lower location errors at longer ranges, compared to the standard approach in which image data is mapped onto the radar data. This led them to the conclusion that it would make more sense to fuse the data in the 2D images, before projecting it back to a birds eye view for vehicle tracking.

As well as allowing precise localization at distances of up to 500 m, the researchers showed that the new technique also boosted the average precision of tracking at shorter distances by 32 percent compared to previous approaches. While the researchers have only tested the approach offline on previously collected datasets, Zhang said the underlying calculations are relatively simple and should be possible to implement in real-time on standard processors.

Using more than one sensor also entails careful synchronization, to ensure that their data streams match up. Over time, environmental disturbances inevitably cause the sensors to drift apart, and they have to be recalibrated. This involves driving the GPS-equipped vehicle up and down the expressway to collect ground truth location measurements that can be used to tune the sensors.

This is time-consuming and costly, so the researchers also built a self-calibration capability into their system. The process of projecting the radar data onto the 2D image is governed by a transformation matrix based on the sensors’ parameters and physical measurements done during the calibration process. Once the data has been projected, an algorithm then tries to match up radar data points with the corresponding image pixels.

If the distance between these data points starts to increase, that suggests the transformation matrix is becoming increasingly inaccurate as the sensors move. By carefully tracking this drift, the researchers are able to automatically adjust the transformation matrix to account for the error. This only works up to a point, says Zhang, but it could still significantly reduce the number of full-blown calibrations required.

Altogether, Zhang says this makes their approach practical to deploy in the real-world. As well as providing better data for intelligent transport systems, she thinks this kind of road-side perception could also provide future self-driving cars with valuable situational awareness.

“It’s a little futuristic, but let’s say there is something happening a few 100 meters away and the car is not aware of it, because it’s congested, and its sensing range couldn’t reach that far,” she says. “Sensors along the highway can disseminate this information to the cars that are coming into the area, so that they can be more cautious or select a different route.”

Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection


 Y. Wang et al., "Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023, pp. 13394-13403, doi: 10.1109/CVPR52729.2023.01287.

Abstract: LiDAR and Radar are two complementary sensing approaches in that LiDAR specializes in capturing an object's 3D shape while Radar provides longer detection ranges as well as velocity hints. Though seemingly natural, how to efficiently combine them for improved feature representation is still unclear. The main challenge arises from that Radar data are extremely sparse and lack height information. Therefore, directly integrating Radar features into LiDAR-centric detection networks is not optimal. 

In this work, we introduce a bi-directional LiDAR-Radar fusion framework, termed Bi-LRFusion, to tackle the challenges and improve 3D detection for dynamic objects. Technically, Bi-LRFusion involves two steps: first, it enriches Radar's local features by learning important details from the LiDAR branch to alleviate the problems caused by the absence of height information and extreme sparsity; second, it combines LiDAR features with the enhanced Radar features in a unified bird's-eye-view representation. We conduct extensive experiments on nuScenes and ORR datasets, and show that our Bi-LRFusion achieves state-of-the-art performance for detecting dynamic objects. Notably, Radar data in these two datasets have different formats, which demonstrates the generalizability of our method. Codes will be published. 

keywords: {Laser radar;Three-dimensional displays;Radar measurements;Shape;Pipelines;Radar detection;Bidirectional control;3D from multi-view and sensors},

URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10204871&isnumber=10203050

SECTION 1. Introduction

LiDAR has been considered as the primary sensor in the perception subsystem of most autonomous vehicles (AVs) due to its capability of providing accurate position measurements [9], [16], [32]. However, in addition to object positions, AVs are also in an urgent need for estimating the motion state information (e.g., velocity), especially for dynamic objects. Such information cannot be measured by LiDAR sensors since they are insensitive to motion. As a result, millimeter-wave Radar (referred to as Radar in this paper) sensors are engaged because they are able to infer the object's relative radial velocity [21] based upon the Doppler effect [28]. Besides, on-vehicle Radar usually offers longer detection range than LiDAR [36], which is particularly useful on highways and expressways. In the exploration of combining LiDAR and Radar data for ameliorating 3D dynamic object detection, the existing approaches [22], [25], [36] follow the common mechanism of uni-directional fusion, as shown in Figure 1 (a). Specifically, these approaches directly utilize the Radar data/feature to enhance the LiDAR-centric detection network without first improving the quality of the feature representation of the former.

Figure 1. - An illustration of (a) Uni-directional LiDAR-radar fusion mechanism, (b) Our proposed bi-directional LiDAR-radar fusion mechanism, and (c) The average precision gain (%) of unidirectional fusion method RadarNet* against the LiDAR-centric baseline centerpoint [40] over categories with different average height (m). We use * to indicate it is re-produced by us on the centerpoint. The improvement by involving radar data is not consistent for objects with different height, i.e., taller objects like truck, bus and trailer do not enjoy as much performance gain. Note that all height values are transformed to the LiDAR coordinate system.
Figure 1.

An illustration of (a) Uni-directional LiDAR-radar fusion mechanism, (b) Our proposed bi-directional LiDAR-radar fusion mechanism, and (c) The average precision gain (%) of unidirectional fusion method RadarNet* against the LiDAR-centric baseline centerpoint [40] over categories with different average height (m). We use * to indicate it is re-produced by us on the centerpoint. The improvement by involving radar data is not consistent for objects with different height, i.e., taller objects like truck, bus and trailer do not enjoy as much performance gain. Note that all height values are transformed to the LiDAR coordinate system.

However, independently extracted Radar features are not enough for refining LiDAR features, since Radar data are extremely sparse and lack the height information.1. Specifically, taking the data from the nuScenes dataset [4] as an example, the 32-beam LiDAR sensor produces approximately 30,000 points, while the Radar sensor only captures about 200 points for the same scene. The resulting Radar bird's eye view (BEV) feature hardly attains valid local information after being processed by local operators (e.g., the neighbors are most likely empty when a non-empty Radar BEV pixel is convolved by convolutional kernels). Besides, on-vehicle Radar antennas are commonly arranged horizontally, hence missing the height information in the vertical direction. In previous works, the height values of the Radar points are simply set as the ego Radar sensor's height. Therefore, when features from Radar are used for enhancing the feature of LiDAR, the problematic height information of Radar leads to unstable improvements for objects with different height. For example, Figure 1 (c) illustrates this problem. The representative method RadarNet falls short in the detection performance for tall objects – the truck class even experiences 0.5% AP degradation after fusing the Radar data.

The Radar points are simply set as the ego Radar sensor's height.2 because their Radar antennas are arranged along the x and y dimensions, with z missing. In these datasets, the z values of the Radar points are all set to the same value: the Radar sensor's height.

In order to better harvest the benefit of LiDAR and Radar fusion, our viewpoint is that Radar features need to be more powerful before being fused. Therefore, we first enrich the Radar features – with the help of LiDAR data – and then integrate the enriched Radar features into the LiDAR processing branch for more effective fusion. As depicted in Figure 1 (b), we refer to this scheme as hidirectional fusion. And in this work, we introduce a framework, BiLRFusion, to achieve this goal. Specifically, Bi-LRFusion first encodes BEV features for each modality individually. Next, it engages the query-based LiDAR-to-Radar (L2R) height feature fusion and query-based L2R BEV feature fusion, in which we query and group LiDAR points and Li-DAR BEV features that are close to the location of each non-empty gird cell on the Radar feature map, respectively. The grouped LiDAR raw points are aggregated to formulate pseudo-Radar height features, and the grouped LiDAR BEV features are aggregated to produce pseudo-Radar BEV features. The generated pseudo-Radar height and BEV features are fused to the Radar BEV features through concatenation. After enriching the Radar features, Bi-LRFusion then performs the Radar-to-LiDAR (R2L) fusion in a unified BEV representation. Finally, a BEV detection network consisting of a BEV backbone network and a detection head is applied to output 3D object detection results.

We validate the merits of bi-directional LiDAR-Radar fusion via evaluating our Bi-LRFusion on nuScenes and Oxford Radar RobotCar (ORR) [1] datasets. On nuScenes dataset, Bi-LRFusion improves the mAP(↑) by 2.7% and reduces the mAVE(↓) by 5.3% against the LiDAR-centric baseline CenterPoint [40], and remarkably outperforms the strongest counterpart, i.e., RadarNet, in terms of AP by absolutely 2.0% for cars and 6.3% for motorcycles. Moreover, Bi-LRFusion generalizes well on the ORR dataset, which has a different Radar data format and achieves 1.3% AP improvements for vehicle detection.

In summary, we make the following contributions:

  • We propose a bi-directional fusion framework, namely Bi-LRFusion, to combine LiDAR and Radar features for improving 3D dynamic object detection.

  • We devise the query-based L2R height feature fusion and query-based L2R BEV feature fusion to enrich Radar features with the help of LiDAR data.

  • We conduct extensive experiments to validate the merits of our method and show considerably improved results on two different datasets.

SECTION 2. Related Work

2.1 3D Object Detection with Only LiDAR Data

In the literature, 3D object detectors built on LiDAR point clouds are categorized into two directions, i.e., point-based methods and voxel-based methods.

Point-Based Methods

Point-based methods maintain the precise position measurement of raw points, extracting point features by point networks [23], [24] or graph networks [33], [34]. The early work PointRCNN [27] follows the two-stage pipeline to first produce 3D proposals from preassigned anchor boxes over sampled key points, and then refines these coarse proposals with region-wise features. STD [39] proposes the sparse-to-dense strategy for better proposal refinement. A more recent work 3DSSD [38] further introduces feature based key point sampling strategy as a complement of previous distance based one, and develops a one-stage object detector operating on raw points.

Voxel-Based Methods

Voxel-based methods first divide points into regular grids, and then leverage the convolutional neural networks [8], [9], [15], [26], [35], [41] and Transformers [11], [20] for feature extraction and bounding box prediction. SECOND [35] reduces the computational overhead of dense 3D CNNs by applying sparse convolution. PointPillars [15] lntroduces a pillar representation (a particular form of the voxel) to formulate with 2D convolutions. To simplify and improve previous 3D detection pipelines, Center-Point [40] designs an anchor-free one-stage detector, which extracts BEV features from voxelized point clouds to find object centers and regress to 3D bounding boxes. Further-more, CenterPoint introduces a velocity head to predict the object's velocity between consecutive frames. In this work, we exploit CenterPoint as our LiDAR-only baseline.

Figure 2. - An overview of the proposed Bi-LRFusion framework. Bi-LRFusion includes five main components: (a) A LiDAR feature stream to encode LiDAR BEV features from LiDAR data, (b) A radar feature stream to encode radar BEV features from radar data, (c) A LiDAR-to-radar (L2R) fusion module composed of a query-based height feature fusion block and a query-based bev feature fusion block, in which we enhance radar features from LiDAR raw points and LiDAR features, (d) A radar-to-LiDAR (R2L) fusion module to fuse back the enhanced radar features to the LiDAR-centric detection network, and (e) A BEV detection network that uses the features from the R2L fusion module to predict 3D bounding boxes for dynamic objects.
Figure 2.

An overview of the proposed Bi-LRFusion framework. Bi-LRFusion includes five main components: (a) A LiDAR feature stream to encode LiDAR BEV features from LiDAR data, (b) A radar feature stream to encode radar BEV features from radar data, (c) A LiDAR-to-radar (L2R) fusion module composed of a query-based height feature fusion block and a query-based bev feature fusion block, in which we enhance radar features from LiDAR raw points and LiDAR features, (d) A radar-to-LiDAR (R2L) fusion module to fuse back the enhanced radar features to the LiDAR-centric detection network, and (e) A BEV detection network that uses the features from the R2L fusion module to predict 3D bounding boxes for dynamic objects.

2.2 3D Object Detection with Radar Fusion

Providing larger detection range and additional velocity hints, Radar data show great potential in 3D object detection. However, since they are too sparse to be solely used [4], Radar data are generally explored as the complement of RGB images or LiDAR point clouds. The approaches that fuse Radar data for improving 3D object detectors can be summarized into two categories: one is input-level fusion, and the other is feature-level fusion.

Input-Level Radar Fusion

RVF-Net [22] develops an early fusion scheme to treat raw Radar points as additional input besides LiDAR point clouds, ignoring the differences in data property between Radar and LiDAR. As studied in [7], these input-level fusion methods directly incorporate Radar raw information into LiDAR branch, which is sensitive to even slight changes of the input data and is also unable to fully utilize the muli-modal feature.

Feature-Level Radar Fusion

The early work GRIF [14] proposes to extract region-wise features from Radar and camera branches, and to combine them together for robust 3D detection. Recent works generally transform image features to the BEV plane for feature fusion [12], [13], [17]. MVD-Net [25] encodes Radar points' intensity into a BEV feature map, and fuses it with LiDAR features to facilitate vehicle detection under foggy weather. The representative work RadarNet [36] first extracts LiDAR and Radar features via modality-specific branches, and then fuses them on the shared BEV perspective. The existing feature-level Radar fusion methods commonly ignore the problems caused by the height missing and extreme sparsity of Radar data, as well as overlooking the information intensity gap when fusing the multi-modal features.

Despite both input-level and feature-level methods have improved 3D dynamic object detection via Radar fusion, they follow the uni-directional fusion scheme. On the contrary, our Bi-LRFusion, for the first time, treats LiDAR-Radar fusion in a bi-directional way. We enhance the Radar feature with the help of LiDAR data to alleviate issues caused by the absence of height information and extreme sparsity, and then fuse it to the LiDAR-centric network to achieve further performance boosting.

SECTION 3. Methodology

In this work, we present Bi-LRFusion, a bi-directional LiDAR-Radar fusion framework for 3D dynamic object detection. As illustrated in Figure 2, the input LiDAR and Radar points are fed into the sibling LiDAR feature stream and Radar feature stream to produce their BEV features. Next, we involve a LiDAR-to-Radar (L2R) fusion step to enhance the extremely sparse Radar features that lack discriminative details. Specifically, for each valid (i.e., nonempty) grid cell on the Radar feature map, we query and group the nearby LiDAR data (including both raw points and BEV features) to obtain more detailed Radar features. Here, we focus on the height information that are completely missing in the Radar data and the local BEV features that are scarce in the Radar data. Through two query-based feature fusion blocks, we can transfer the knowledge from LiDAR data to Radar features, leading to much-enriched Radar features. Subsequently, we perform Radar-to-LiDAR (R2L) fusion by integrating the enriched Radar features to the LiDAR features in a unified BEV representation. Finally, a BEV detection network composed of a BEV back-bone network and a detection head outputs 3D detection results. Below we describe these steps one by one.

3.1. Modality-Specific Feature Encoding

LiDAR Feature Encoding

LiDAR feature encoding consists of the following steps. Firstly, we divide LiDAR points into 3D regular grids and encode the feature of each grid by a multi-layer perception (MLP) followed by max-pooling. Here, a grid is known as a voxel in the discrete 3D space, and the encoded feature is the voxel feature. After obtaining voxel features, we follow the common practice to exploit a 3D voxel backbone network composed of 3D sparse con-volutional layers and 3D sub-manifold convolutional layers [35] for LiDAR feature extraction. Then, the output feature volume is stacked along the Z-axis, producing a LiDAR BEV feature map MLRC1×H×W, where (H,W) indicate the height and width.

Radar Feature Encoding

A Radar point contains the 2D coordinate (x,y), the Radar cross-section rcs, and the times-tamp t. In addition, we may also have3the radial velocity of the object in XY directions (vX,vY), the dynamic property dynProp, the cluster validity state invalid_state. and the false alarm probability pdh0. Please note that the Radar point's value on the Z-axis is set to be the Radar sensor's height by default. Here, we exploit the pillar [15] representation to encode Radar features, which directly converts the Radar input to a pseudo image in the bird's eye view (BEV). Then we extract Radar features with a pillar feature network, obtaining a Radar BEV feature map MRRC2×H×W.

3.2. LiDAR-to-Radar Fusion

LiDAR-to-Radar (L2R) fusion is the core step in our proposed Bi-LRFusion. It involves two L2R feature fusion blocks, in which we generate pseudo height features, as well as the pseudo local BEV features, by suitably querying the LiDAR features. These pseudo features are then fused into the Radar features to enhance their quality.

Figure 3. - An illustration of query-based L2R height feature fusion (QHF) block. QHF involves the following steps: (a) Lifting the non-empty grid cell on the radar feature map to a pillar, and equally dividing the pillar into segments of different heights, (b) Querying and grouping neighboring LiDAR points based on the location of each segment's center, (c) Aggregating grouped LiDAR points to get the local height feature of each segment, and (d) Merging the segments' feature together to produce pseudo height feature $\boldsymbol{\eta}_{H}$ for the corresponding radar grid cell.
Figure 3.

An illustration of query-based L2R height feature fusion (QHF) block. QHF involves the following steps: (a) Lifting the non-empty grid cell on the radar feature map to a pillar, and equally dividing the pillar into segments of different heights, (b) Querying and grouping neighboring LiDAR points based on the location of each segment's center, (c) Aggregating grouped LiDAR points to get the local height feature of each segment, and (d) Merging the segments' feature together to produce pseudo height feature ηH for the corresponding radar grid cell.

Query-Based L2R Height Feature Fusion

As illustrated in Figure 1 (c), directly fusing the Radar features to LiDAR-based 3D detection networks may lead to unsatisfactory results, especially for objects that are taller than the Radar sensor. This is caused by the fact that Radar points do not contain the height measurements. To address this insufficiency' we design the query-based L2R height feature fusion (QHF) block to transfer the height distributions from LiDAR raw points to Radar features.

The pipeline of QHF is depicted in Figure 3. The core innovation of QHF is the height feature querying mechanism. Given the l-th valid grid cell on the Radar feature map centered at (x,y), we first “lift” it from the BEV plane to the 3D space, which results in a pillar of height h. Then, we evenly divide the pillar into M segments along the height and assign a query point at each segment's center. Specifically, let us denote the grid size of the Radar BEV feature map as r×r. In order to query LiDAR points without overlap, the radius of the ball query is set to r/2, and the number of segments M is set to h/2r. For the query point in s-th segment shown in Figure 3, the (x,y) coordinates are from the central point of the given grid cell, which can be calculated with the grid indices, and grid sizes, together with the boundaries of the Radar points. Further, the height value zs of the query point is calculated as:

zs=zM+r×(2s1),(1)
View SourceRight-click on figure for MathML and additional features. where zM is the minimum height value among all the Li-

DAR points. After establishing the query points, we then apply ball query [24] followed by a PointNet module [23] to aggregate the local height feature Fs from the grouped LiDAR points. The calculation can be formulated as:

Fs=maxk=1,2,,K{Ψ(nks)},(2)
View SourceRight-click on figure for MathML and additional features. where nks denotes the k-th grouped LiDAR point in the s-th ball query segment, K is the number of grouped points from ball query, Ψ() indicates an MLP, and max() is the max pooling operation. After obtaining the local height feature of each segment, we concatenate them together and feed the concatenated feature into an MLP to make the output channels have the same dimensions as the Radar BEV feature map. Finally, the output pseudo height feature ηlH for the l-th grid cell on the Radar feature map produced by QHF is computed as:
ηlH=MLP(Concat({Fs}Ms=0)).(3)
View SourceRight-click on figure for MathML and additional features.

Figure 4. - An illustration of query-based L2R BEV feature fusion (QBF) block with the following steps: (a) Collapsing LiDAR 3D features to the BEV grids, (b) Querying and grouping LiDAR grids that are close to the radar query grid based on their indices on the BEV, and (c) Aggregating the features of the grouped LiDAR grids to generate the pseudo BEV feature $\boldsymbol{\eta}_{B}$ for the corresponding radar grid cell.
Figure 4.

An illustration of query-based L2R BEV feature fusion (QBF) block with the following steps: (a) Collapsing LiDAR 3D features to the BEV grids, (b) Querying and grouping LiDAR grids that are close to the radar query grid based on their indices on the BEV, and (c) Aggregating the features of the grouped LiDAR grids to generate the pseudo BEV feature ηB for the corresponding radar grid cell.

Query-Based L2R BEV Feature Fusion

When extremely sparse Radar BEV pixels are convoluted by convolutional kernels, the resulting Radar BEV feature barely retains valid local information since most of the neighboring pixels are empty. To alleviate this problem, we design a query-based L2R BEV feature fusion block (QBF) that can generate the pseudo BEV feature which is more detailed than the original Radar BEV feature, by querying and grouping the corresponding fine-grained LiDAR BEV features.

The pipeline of QBF is depicted in Figure 4. The core innovation of QBF is the local BEV feature querying mechanism that we describe below. We first collapse LiDAR 3D features to the BEV grid plane [15], forming a set of nonempty LiDAR grids (the same grid size as the Radar grid cell in Figure 4). Given the l th non-empty grid cell on the Radar feature map with indices (i, j), we query and group LiDAR grid features that are close to the Radar grid cell on the BEV plane. Specifically, towards this goal, we propose a local BEV feature query inspired by Voxel R-CNN [8], which finds all LiDAR grid features that are within a certain distance from the querying grid based on their grid indices on the BEV plane. Specifically, we exploit the Manhattan distance metric and sample up to K non-empty Li-DAR grids within a specific distance threshold on the BEV plane. The Manhattan distance D(α,β) between indices of LiDAR grids α={iα,jα} and β={iβ,jβ} can be calculated as:

D(α,β)=|iαiβ|+|jαjβ|,(4)
View SourceRight-click on figure for MathML and additional features. where i and j are the indices of the LiDAR grid along the X and Y axis. After the l-th BEV feature query, we obtain the corresponding features of the grouped LiDAR grids. Finally, we apply a PointNet module [23] to aggregate the pseudo Radar BEV feature ηlB, which can be summarized as:
ηlB=maxk=1,2,,K{Ψ(Fkl)},(5)
View SourceRight-click on figure for MathML and additional features. where Fkl denotes the k-th grouped LiDAR grid feature in the l-th BEV query mechanism.

3.3. Radar-to-LiDAR Fusion

After enriching the Radar BEV features with the pseudo height feature ηH and pseudo BEV features ηB, we obtain the enhanced Radar BEV features that have 96 channels.

In this step, we fuse the enhanced Radar BEV features to the LiDAR-based 3D detection pipeline so as to incorporate valuable clues such as velocity information. Specifically, we concatenate the two BEV features in the channel-wise fashion following the practice from [19], [36]. Before forwarding to the BEV detection network, we also apply a convolution-based BEV encoder to help curb the effect of misalignment between multi-modal BEV features. The BEV encoder adjusts the fused BEV feature to 512 through three 2D convolution blocks.

3.4. BEV Detection Network

Finally, the combined LiDAR and Radar BEV features are fed to the BEV detection network to output the results. The BEV detection network consists of a BEV network and a detection head. The BEV network is composed of several 2D convolution blocks, which generate center features forwarding to the detection head. As such, we use a class-specific center heatmap head to predict the center location of all dynamic objects and a few regression heads to estimate the object size, rotation, and velocity based on the center features. We combine all heatmap and regression losses in one common objective and jointly optimize them following the baseline CenterPoint [40].

Table 1. Comparison with other methods on the nuScenes validation set. We add “*” to indicate that it is a reproducing version based on CenterPoint [40]. We group the dynamic targets to (1) Similar-height objects and (2) Tall objects according to the radar sensor's height.
Table 1.- Comparison with other methods on the nuScenes validation set. We add “*” to indicate that it is a reproducing version based on CenterPoint [40]. We group the dynamic targets to (1) Similar-height objects and (2) Tall objects according to the radar sensor's height.

SECTION 4.

Experiments

We evaluate Bi-LRFusion on both the nuScenes and the Oxford Radar RobotCar (ORR) datasets and conduct ablation studies to verify our proposed fusion modules. We further show the advantages of Bi-LRFusion on objects with different heights/velocities.

4.1. Datasets and Evaluation Metrics

NuScenes Dataset

NuScenes [4] is a large-scale dataset for 3D detection including camera images, LiDAR points, and Radar points. We mainly adopt two metrics in our evaluation. The first one is the average precision (AP) with a match threshold of 2D center distance on the ground plane. The second one is AVE which stands for absolute velocity error in m/s - its decrease represents more accurate velocity estimation. We average the two metrics over all 7 dynamic classes (mAP, mAVE), following the official evaluation.

ORR Dataset

The ORR dataset, mainly for localization and mapping tasks, is a challenging dataset including camera images, LiDAR points, Radar scans, GPS and INS ground truth. We split the first data record into 7,064 frames for training and 1,760 frames for validation. As this dataset does not provide ground truth of object annotation, we therefore follow the MVDet [25] to generate 3D boxes of vehicles. For evaluation metrics, we use AP of oriented bounding boxes in BEV to validate the vehicle detection performance via the CoCo evaluation framework [18].

4.2. Implementation Details

LiDAR Input

As allowed in the nuScenes submission rule, we accumulate 10 LiDAR sweeps to form a denser point cloud. we set detection range to (–54.0,54.0) m for the X, Y axis, and (–5.0, 3.0) m for Z axis with a voxel size of (0.075,0. 075,0.2)m. For ORR dataset, we set the point cloud range to (–69.12, 69.12)m for X and Y axis, (–5.0, 2.0)m for Z axis and voxel size to (0.32,0.32,7.0)m.

Radar Input in NuScenes

Radar data are collected from 5 long-range Radar sensors and stored as BIN files, which is the same format as LiDAR point cloud. We stack points captured by five Radar sensors into full-view Radar point clouds. We also accumulate 6 sweeps to form denser Radar points. The detection range of Radar data is consistent with the LiDAR range. The voxel size is (0.6,0.6,8.0) m for nuScenes and (0.32, 0.32, 7.0)m for ORR dataset. We transfer the 2D position and velocity of Radar points from the Radar coordinates to the LiDAR coordinates.

Radar Input in Orr

The Radar data are collected by a vehicle equipped with only one spinning Millimetre-Wave FMCW Radar sensor and are saved as PNG files. Therefore, we need to convert the Radar images to Radar point clouds. First, we use cen2019 [5] as a feature extractor to extract feature points from Radar images. Attributes of every extracted point include x,y,z,rcs,t, where rcs is represented by gray-scale values on the Radar image, and z is set to 0. To decrease the noise points and ghost points due to multi-path effects, we apply a geometry-probabilistic filter [10] and the number of points is reduced to around 1,000 per frame. Second, since Radar's considerable scanning delay could cause the lack of synchronization with the LiDAR, we compensate for the ego-motion via SLAM [30]. Please refer to supplementary materials for more details.

4.3. Comparison on Nuscenes Dataset

We evaluate our Bi-LRFusion with AP (%) and AVE (m/s) of 7 dynamic classes on both validation and test sets.

First, we compare our Bi-LRFusion with a few top-ranked methods on the nuScenes validation set, including the vanilla CenterPoint [40] as our baseline, the LiDAR-Radar fusion models and other LiDAR-only models on nuScenes benchmark. For a fair comparison, we use consistent settings with the original papers and reproduce the results on our own. Table 1 summarizes the results. Over-all, our results show solid performance improvements. The SOTA LiDAR-Radar fusion method, RadarNet [36], mainly focused on improving the AP values for cars and motor-cycles in their study. Compared to RadarNet, we can further improve the AP by +2.0% for cars and +6.3% for motorcycles. In addition, when we consider all dynamic object categories, Bi-LRFusion can increase the mAP(↑) by +2.7% and improve mAVE(↓) by −5.3% compared to CenterPoint. Meanwhile, our approach surpasses the reproduced RadarNet* by +1.6% in mAP and −1.2% in mAVE.

Table 2. Comparison with other methods on the nuScenes test set. We add “*” to indicate that we reproduce and submit the result based on CenterPoint [40]. We group the dynamic targets to (1) similar-height objects and (2) tall objects according to the radar sensor's height.
Table 2.- Comparison with other methods on the nuScenes test set. We add “*” to indicate that we reproduce and submit the result based on CenterPoint [40]. We group the dynamic targets to (1) similar-height objects and (2) tall objects according to the radar sensor's height.
Table 3. Comparison with other methods of vehicle detection accuracy on the ORR dataset.
Table 3.- Comparison with other methods of vehicle detection accuracy on the ORR dataset.

We also compare our method with several top-performing models on the test set. The detailed results are listed in Table 2. The results show that Bi-LRFusion gives the best average results, in terms of both mAP and mAVE. For six out of seven individual object categories, it has the best AP results. Even for the only exception category (the pedestrian), its AP is the second best, with only 0.1% lower than the best. Overall, the results on val and test sets consistently demonstrate the effectiveness of Bi-LRFusion.

Furthermore, we qualitatively compare with CenterPoint and Bi-LRFusion on the nuScenes dataset. Figure 5 shows a few visualized detection results, which demonstrates that the enhanced Radar data from L2R feature fusion module can indeed better eliminate the missed detections.

4.4. Comparison on ORR Dataset

To further validate the effectiveness of our proposed Bi-LRFusion, we conduct experiments on the challenging ORR dataset. Due to the lack of ground truth annotations, we only report the AP for cars with IoU thresholds of 0.5 and 0.8, following the common COCO protocol.

We compare our Bi-LRFusion with several LiDAR-only and LiDAR-Radar fusion methods on the ORR dataset. From Table 3, Bi-LRFusion achieves the best AP results with IoU threshold of 0.5, and second best with IoU threshold of 0.8. For IoU threshold of 0.8, its AP is only 0.2% lower than the best, and much higher than other schemes. Table 3 also shows that Bi-LRFusion is effective in handling different Radar data formats.

Table 4. The effect of each proposed component in Bi-LRFusion. It shows that all components contribute to the overall detection performance.
Table 4.- The effect of each proposed component in Bi-LRFusion. It shows that all components contribute to the overall detection performance.

4.5. Effect of Different Bi-Lrfusion Modules

To understand how each module in Bi-LRFusion affects the detection performance, we conduct ablation studies on the nuScenes validation set and report its mAP and mAVE of all dynamic classes in Table 4.

  • Method (a) is our LiDAR-only baseline CenterPoint, which achieves mAP of 59.3% and mAVE of 30.3%.

  • Method (b) extends (a) by simply fusing the Radar feature via R2L fusion, which improves mAP by 1.1% and mAVE by −4.1%. This indicates that integrating the Radar feature is effective to improve 3D dynamic detection.

  • Method (c) extends (b) by utilizing the raw LiDAR points to enhance the Radar features via query-based L2R height feature fusion (QHF), which leads to an improvement of 2.0% mAP and −5.2% mAVE.

  • Method (d) extends (b) by taking advantage of the detailed LiDAR features on the BEV plane to enhance Radar features via query-based L2R BEV feature fusion (QBF), improving mAP by 2.6% and mAVE by −4.8%.

  • Method (e) is our Bi-LRFusion. By combining all the components, it achieves a gain of 2.7% for mAP and −5.3% for mAVE compared to CenterPoint. By enhancing the Radars features before fusing them into the detection network, our bi-directional LiDAR-Radar fusion framework can considerably improve the detection of moving objects.

4.6. Effect of Object Parameters

We also evaluate the performance gain of Bi-LRFusion for objects with different heights/velocities compared with the LiDAR-centric baseline CenterPoint.

Table 5. MAP and mAVE results for LiDAR-only center point [40] and Bi-LRFusion for objects with different velocities.
Table 5.- MAP and mAVE results for LiDAR-only center point [40] and Bi-LRFusion for objects with different velocities.
Table 6. The percentage (%) of objects with different velocities in nuScenes dataset, i.e., stationary (0 m/s–0.5 m/s), low-velocity (0.5 m/s–5.0 m/s), medium-velocity (5.0 m/s–10.0 m/s), high-velocity (≥ 10.0 m/s). These groups are divided according to [4].
Table 6.- The percentage (%) of objects with different velocities in nuScenes dataset, i.e., stationary (0 m/s–0.5 m/s), low-velocity (0.5 m/s–5.0 m/s), medium-velocity (5.0 m/s–10.0 m/s), high-velocity (≥ 10.0 m/s). These groups are divided according to [4].

Effect of Object Height

We group the dynamic objects into two groups: (1) regular-height objects including cars, motorcycles, bicycles, pedestrians, and (2) tall objects including trucks, buses, and trailers. Note that since millimeter wave is not sensitive to non-rigid targets, the AP gain of pedestrians is hence small or even nonexistent. From Table 1, RadarNet* shows +1.9% AP for group (1) and +0.0% AP for group (2) averagely over the CenterPoint, while BiLR Fusion shows +3.5% AP for group (1) and +1.2% AP for group (2) averagely. We summarize that RadarNet* improves the AP values for group (1) objects, but not for group (2). The height missing problem in each Radar feature makes it difficult to detect specific tall objects. On the other hand, our Bi-LRFusion effectively avoids this problem and improves the AP for both group (1) and (2) over the baseline CenterPoint, regardless of the object's height.

Effect of Object Velocity

Next, we look at the advantage of Bi-LRFusion over objects that move at different speeds. Following [4], we consider objects with speed>0.5m/s as moving objects, 5m/s and 10m/s are the borderlines to distinguish low and medium moving state. More concretely, we also show the percentages of cars and motorcycles within different velocity ranges in the nuScenes dataset in Table 6: around 70% of cars and motorcycles are stationary, only around 4% of them are moving faster than > 10m/s, and the rest of them are uniformly distributed in the low / medium velocity range. Tables 5 (a) and (b) show the mAP and mAVE of the LiDAR-only detector and BiLR Fusion for cars and motorcycles that move at different speeds. The results show that the performance improvement is more pronounced for objects with high velocity. This demonstrates the effectiveness of Radar in detecting dynamic objects, especially those with faster speeds.

Figure 5. - Qualitative comparison between the LiDAR-only method CenterPoint [40] and our Bi-LRFusion. The grey dots and red lines are LiDAR points and radar points with velocities. We visualize the prediction and ground truth (green) boxes. Specifically, blue circles represent the missed detections from center-point but Bi-LRFusion corrects them by integrating radar data.
Figure 5.

Qualitative comparison between the LiDAR-only method CenterPoint [40] and our Bi-LRFusion. The grey dots and red lines are LiDAR points and radar points with velocities. We visualize the prediction and ground truth (green) boxes. Specifically, blue circles represent the missed detections from center-point but Bi-LRFusion corrects them by integrating radar data.

SECTION 5. Conclusion

In this paper, we introduce Bi-LRFusion, a bi-directional fusion framework that fuses the complementary LiDAR and Radar data for improving 3D dynamic object detection. Unlike existing LiDAR-Radar fusion schemes that directly integrate Radar features into the LiDAR-centric pipeline, we first make Radar features more discriminative by transferring the knowledge from LiDAR data (including both raw points and BEV features) to the Radar features, which alleviates the problems with Radar data, i.e., lack of object height measurements and extreme sparsity. With our bi-directional fusion framework, BiLR Fusion outperforms the earlier schemes for detecting dynamic objects on both nuScenes and ORR datasets.

ACKNOWLEDGEMENTS

We thank Dequan Wang for his help. This work was done during her internship at Shanghai AI Laboratory. This work was supported by the Chinese Academy of Sciences Frontier Science Key Research Project ZDBS-LY-JSC001, the Australian Medical Research Future Fund MRFAI000085, the National Key R&D Program of China(No.2022ZD0160100), and Shanghai Committee of Science and Technology(No.21DZ1100100).

 

No comments:

Post a Comment

A Digital Engineering Approach to Testing Modern AI and Complex Systems

Range and Doppler MAE for all algorithms on the excursion dataset.  Air Force Research Lab Pioneers New AI Testing Framework for Military Sy...