Sub-Resolution mmWave FMCW Radar-based Touch Localization using Deep Learning

Electrical Engineering and Systems Science > Signal Processing

[Submitted on 7 Aug 2024]

Summary

This document presents research on using millimeter wave (mmWave) radar sensors for touch localization on large displays, as an alternative to expensive capacitive touchscreens. Here are the key points:

1. Problem: Capacitive touchscreens become prohibitively expensive for large displays (>15 inches).

2. Proposed solution: A system using four mmWave FMCW (Frequency Modulated Continuous Wave) radar sensors placed at the corners of the display.

3. Challenge: The inherent range resolution limitations of mmWave radars make it difficult to achieve accurate positioning.

4. Approach: The researchers use a deep learning-based method, training a Convolutional Neural Network (CNN) on range-FFT (Fast Fourier Transform) features against ground truth positions obtained from a capacitive touchscreen.

5. Experimental setup:
   - A metal finger mounted on a robotic arm is used as the target.
   - Data is collected on a 15.6-inch display with a grid of touch points.
   - The system uses four mmWave radar sensors operating at 60 GHz.

6. Results:
   - The CNN-based approach achieves sub-resolution accuracy, with a 90th percentile error of 1.6 cm.
   - This is a 2-3x improvement over conventional signal processing methods.
   - The CNN model has a small size (~350 KB) and fast inference time (~2 ms), making it suitable for real-time implementation.

7. Significance: This approach could enable low-cost touch interaction for large displays, potentially extending to applications like air handwriting on non-touchscreen displays.

The research demonstrates the potential of mmWave radar technology combined with machine learning to provide accurate touch localization at a lower cost than traditional capacitive touchscreens for large displays.

Robot v. Human Finger

The paper does not explicitly describe a comprehensive process for ensuring equivalence between the robot finger and a human finger. However, the authors do provide some rationale for their approach and mention certain characteristics that make their setup relevant for testing human finger touch localization. Here are the key points:

1. Metal finger as target:
   The researchers use a metal finger mounted on a metallic robot arm as the target. They state that this setup is designed to "emulate the clutter characteristics encountered in radar-based positioning of human fingers."

2. Distributed target:
   The authors mention that the finger and the rest of the hand often appear as a single target to each radar sensor, resulting in a range error dictated by the radar's range resolution. The metal finger on the robot arm is intended to mimic this "distributed" nature of a human hand and finger.

3. Range resolution limitations:
   The paper notes that the main factor limiting accurate finger localization is the limited range resolution of the radar. This limitation applies to both the robot setup and human fingers.

4. Lack of direct comparison:
   The paper does not present a direct comparison between the robot finger and human finger performance, which would have been a more robust way to demonstrate equivalence.

5. Focus on sub-resolution accuracy:
   The researchers emphasize achieving sub-resolution accuracy, which is relevant for both robot and human finger localization given the radar's inherent limitations.

6. Generalization assumption:
   There seems to be an implicit assumption that the machine learning model trained on the robot finger data would generalize to human finger touch events, but this is not explicitly tested or discussed in the paper.

Limitations in ensuring equivalence:

1. No human trials:
   The paper doesn't mention any tests with actual human subjects to validate the equivalence of the robot finger to a human finger.

2. Lack of biomechanical considerations:
   The study doesn't address potential differences in the biomechanics of human finger movement compared to the robot arm's movements.

3. Absence of skin and tissue effects:
   The metal finger doesn't account for the potential effects of human skin and tissue on radar signal reflection and scattering.

4. No discussion of variability:
   Human fingers vary in size, shape, and how they interact with surfaces. The robot finger likely doesn't capture this variability.

In conclusion, while the authors have made efforts to create a setup that mimics some characteristics of human finger touch events, the paper lacks a comprehensive validation of the equivalence between the robot finger and human finger for testing purposes. This is a limitation of the study and an area where future work could provide more robust evidence for the applicability of the results to real-world human-computer interaction scenarios.

Multi-Contact Touch

Based on the information provided in the paper, there is no explicit mention or description of multiple contact testing. The research appears to focus exclusively on single-point touch localization. Here's a breakdown of why this conclusion can be drawn:

1. Experimental setup:
   The paper describes a robot-mounted metal finger touching the screen at individual points on a grid. There's no mention of the robot or system being capable of simulating multiple simultaneous touch points.

2. Data collection:
   The data collection process is described as the robot touching the screen at single points in a grid pattern. There's no indication of collecting data for multiple simultaneous touches.

3. Neural network architecture:
   The CNN model described is designed to output a single 2D position estimate (x^t,ML, y^t,ML). This suggests it's trained and optimized for single-point touch localization.

4. Performance metrics:
   All the reported performance metrics (position error distributions, RMSE heatmaps) are for single-point touch localization.

5. Problem statement:
   The paper frames the problem in terms of localizing a single touch point as an alternative to capacitive touchscreens for large displays.

6. Future work:
   While the authors mention potential future applications like handwriting recognition, they don't specifically call out multiple contact testing as an area for future research.

The lack of multiple contact testing is a limitation of this study, particularly considering that multi-touch capability is a common feature in modern touchscreens. This could be an important area for future research to extend the applicability of the mmWave radar-based touch localization system. Multiple contact testing would be crucial for:

1. Gesture recognition: Many touch-based gestures involve multiple fingers.
2. Multi-user interactions: Large displays often need to support multiple users touching the screen simultaneously.
3. Complex interactions: Some applications require tracking multiple touch points for more sophisticated input methods.

In conclusion, based on the information provided in the paper, there does not appear to be any multiple contact testing conducted in this study. The research focuses solely on single-point touch localization, which is a limitation in terms of its direct applicability to more complex touch-based interactions.

Artifacts

Based on the information provided in the paper, there are several key artifacts and details that would be important for independent replication of the work. However, it's worth noting that the paper does not explicitly mention making these artifacts publicly available. Here's a summary of the key elements:

1. Experimental Setup:
   - A 15.6-inch display equipped with a capacitive touchscreen
   - Four mmWave FMCW radar sensors placed at the corners of the screen
   - A metal finger mounted on a programmable robot arm
   - Radar sensor parameters (provided in Table I of the paper)

2. Data Collection:
   - Training dataset: 50 sessions, 31x16 grid, resulting in 24,799 data points
   - Validation and Test datasets: 15 sessions, 30x15 grid, with a 0.5 cm offset from training points
   - Total dimensions: Training (24799, 61, 110, 4), Validation (3600, 61, 110, 4), Test (3150, 61, 110, 4)

3. Radar Signal Processing Pipeline:
   - Detailed steps for pre-processing radar signals (equations provided in the paper)

4. Neural Network Architecture:
   - CNN model architecture (described and illustrated in Figure 4)
   - Model size: ~90,000 parameters, ~350 KB total size

5. Conventional Signal Processing (CSP) Method:
   - Details on range calibration and position estimation using nonlinear least squares

6. Performance Metrics:
   - Position error distributions and RMSE heatmaps for both CNN and CSP methods

For full replication, the following would be necessary but are not explicitly mentioned as being available:

1. Raw Data:
   - The collected radar data and corresponding ground truth positions

2. Code:
   - Implementation of the radar signal processing pipeline
   - Code for the CNN model and training process
   - Implementation of the CSP method

3. Trained Model:
   - The weights of the trained CNN model

4. Detailed Hardware Specifications:
   - Exact models of the radar sensors, robot arm, and other hardware components

5. Experiment Protocol:
   - Detailed procedures for data collection and processing

While the paper provides a comprehensive description of the methods and results, it does not mention a public repository or dataset that would allow for direct replication. Researchers interested in replicating this work would likely need to contact the authors for access to the raw data, code, and potentially the trained model. The lack of explicitly mentioned publicly available artifacts is a limitation for independent replication based solely on the information in this paper.

Figures & Tables

The paper includes several figures and tables that illustrate the experimental setup, methodology, and results. Here's a summary of each:

Figure 1: (a) Schematic of the robot-based data collection setup. (b) mmWave radar network-based positioning setup, showing the Radar Coordinate System (RaCS) and Touchscreen Coordinate System (TCS).

Figure 2: Schematic of the touch locations on the display, showing the grid pattern used for data collection.

Figure 3: Flowchart of the radar signal processing pipeline implemented on each radar sensor. (Detailed explanation below)

Figure 4: The CNN architecture, showing the input (four-channel heatmaps from the radars) and the network layers.

Figure 5: Visualization of how touch points are partitioned into train/validation/test datasets, and their locations relative to the radar sensors.

Figure 6: Distribution of position error for the Conventional Signal Processing (CSP) and Machine Learning (ML) based approaches.

Figure 7: Comparison of RMSE position error heatmaps for (a) CSP and (b) CNN approaches.

Figure 8: Distribution of CNN inference time compared to half the radar frame repetition interval.

Table I: Radar Sensor Parameters, including waveform type, chirp bandwidth, range resolution, frame rate, etc.

Table II: Position Error Performance Comparison on the Test Dataset, showing metrics for both CNN-based and CSP-based approaches.

Detailed explanation of Figure 3:

Figure 3 illustrates the radar signal processing pipeline, which is crucial for generating the features used in both the CSP and ML approaches. Here's how it works:

1. Input:    The pipeline starts with the raw Intermediate Frequency (IF) signal xIF(f,s,c,j,i), where f is the frame index, s is the IF sample index, c is the chirp index, j is the RX antenna index, and i is the sensor index.

2. Zero Padding:   The IF signal is zero-padded (xIF,zp) to increase the effective range resolution.

3. Windowing:   A window function (wIF) is applied to the zero-padded signal to reduce sidelobes in the frequency domain.

4. Range-FFT:   Fast Fourier Transform is performed on the windowed signal to obtain the range information (xr).

5. Clutter Removal:   An IIR moving target indication (MTI) filter is applied to eliminate returns from static objects, resulting in xMTI.

6. Averaging and Beamforming:   The signal is averaged across chirps and boresight beamforming is performed to get the final feature xbf(f,r,i).

This pipeline is applied to the signals from each of the four radar sensors. The resulting features (xbf) are then used as input to the CNN model or for range estimation in the CSP approach.

The pipeline effectively transforms the raw radar signals into a form that highlights the position of the target (finger) while suppressing unwanted clutter and noise. This processed data enables more accurate position estimation in both the CSP and ML approaches.

Authors

Based on the information provided in the document, here are the details about the authors, their associated institutions, and some related previous work:

Authors and Institutions:

Raghunandan M. Rao - Amazon Lab126, Sunnyvale, CA, USA
Amit Kachroo - Amazon Web Services (AWS), Santa Clara, CA, USA
Koushik A. Manjunatha - Amazon Lab126, Sunnyvale, CA, USA
Morris Hsu - Amazon Lab126, Sunnyvale, CA, USA
Rohit Kumar - Amazon Lab126, Sunnyvale, CA, USA

All authors except Amit Kachroo are affiliated with Amazon Lab126, while Kachroo is with Amazon Web Services. This suggests the research is likely an Amazon-led project, combining expertise from their consumer electronics division (Lab126) and cloud computing division (AWS). Amazon Lab126 is an inventive San Francisco Bay Area research and development team that designs and engineers high-profile consumer electronic devices such as Fire tablets, Kindle e-readers, Amazon Fire TV, and Amazon Echo.

Previous Related Work:

The paper mentions several previous studies related to alternative technologies for touch or motion tracking:

1. Ultrasound: Yun et al. (2017) designed a device-free system using ultrasound signals to track human finger motion, achieving a median tracking error of 1 cm.

2. WiFi: Wu et al. (2020) proposed a sub-wavelength finger motion tracking system using WiFi signals, achieving a 90th percentile tracking error of 6 cm.

3. Radio Frequency Identification (RFID):
- Wang et al. (2014) demonstrated an RFID-based system for air handwriting with 90th percentile tracking errors of 9.7 cm (Line-of-Sight) and 10.5 cm (Non-Line-of-Sight).
- Shangguan and Jamieson (2016) proposed an air handwriting system using differentially polarized antennas, achieving a 90th percentile position error of 11 cm.

4. RF Backscattering: Xiao et al. (2019) developed a system to track handwriting traces using a stylus with an embedded RFID tag, achieving a median tracking error of 0.49 cm at a writing speed of 30 cm/s.

5. Millimeter Wave (mmWave): Wei and Zhang (2015) designed a 60 GHz radio-based passive tracking system, achieving 90th percentile position errors of 0.8/5/15 cm for pen/pencil/marker respectively.

6. Ultra-Wideband (UWB): Cao et al. (2021) proposed an IMU-UWB radar fusion-based tracking approach for a stylus-aided handwriting use-case, achieving a median position error of 0.49 cm (with IMU) and a 90th percentile error of 6 cm (without IMU).

These previous works highlight the ongoing research in the field of alternative touch and motion tracking technologies, providing context for the current study's contribution in using mmWave radar with deep learning for sub-resolution accuracy in touch localization.

Raghunandan M. Rao, Amit Kachroo, Koushik A. Manjunatha, Morris Hsu, Rohit Kumar

Abstract: Touchscreen-based interaction on display devices are ubiquitous nowadays. However, capacitive touch screens, the core technology that enables its widespread use, are prohibitively expensive to be used in large displays because the cost increases proportionally with the screen area.

In this paper, we propose a millimeter wave (mmWave) radar-based solution to achieve subresolution error performance using a network of four mmWave radar sensors. Unfortunately, achieving this is non-trivial due to inherent range resolution limitations of mmWave radars, since the target (human hand, finger etc.) is 'distributed' in space. We overcome this using a deep learning-based approach, wherein we train a deep convolutional neural network (CNN) on range-FFT (range vs power profile)-based features against ground truth (GT) positions obtained using a capacitive touch screen. To emulate the clutter characteristics encountered in radar-based positioning of human fingers, we use a metallic finger mounted on a metallic robot arm as the target. Using this setup, we demonstrate subresolution position error performance.

Compared to conventional signal processing (CSP)-based approaches, we achieve a 2-3x reduction in positioning error using the CNN. Furthermore, we observe that the inference time performance and CNN model size support real-time integration of our approach on general purpose processor-based computing platforms.

Comments:	7 pages, 9 figures and 2 tables. To appear in the 100th Vehicular Technology Conference (VTC-Fall 2024)
Subjects:	Signal Processing (eess.SP)
Cite as:	arXiv:2408.03485 [eess.SP]
	(or arXiv:2408.03485v1 [eess.SP] for this version)
	https://doi.org/10.48550/arXiv.2408.03485

Submission history

From: Raghunandan M Rao [view email]
[v1] Wed, 7 Aug 2024 00:33:56 UTC (1,913 KB)

Sub-Resolution mmWave FMCW Radar-based Touch Localization using Deep Learning

†R. M. Rao, K. A. Manjunatha, M. Hsu, and R. Kumar are with Amazon Lab126, Sunnyvale, CA, USA, 94089 (email:{raghmrao, koushiam}@amazon.com, mhsu@lab126.com, rrohk@amazon.com).

†A. Kachroo is with Amazon Web Services (AWS), Santa Clara, CA, 95054 (email: amkachro@amazon.com).

Raghunandan M. Rao, Amit Kachroo, Koushik A. Manjunatha, Morris Hsu, Rohit Kumar

Abstract

Touchscreen-based interaction on display devices are ubiquitous nowadays. However, capacitive touch screens, the core technology that enables its widespread use, are prohibitively expensive to be used in large displays because the cost increases proportionally with the screen area. In this paper, we propose a millimeter wave (mmWave) radar-based solution to achieve sub-resolution error performance using a network of four mmWave radar sensors. Unfortunately, achieving this is non-trivial due to inherent range resolution limitations of mmWave radars, since the target (human hand, finger etc.) is ‘distributed’ in space. We overcome this using a deep learning-based approach, wherein we train a deep convolutional neural network (CNN) on range-FFT (range vs power profile)-based features against ground truth (GT) positions obtained using a capacitive touch screen. To emulate the clutter characteristics encountered in radar-based positioning of human fingers, we use a metallic finger mounted on a metallic robot arm as the target. Using this setup, we demonstrate sub-resolution position error performance. Compared to conventional signal processing (CSP)-based approaches, we achieve a $2 - 3 \times$ reduction in positioning error using the CNN. Furthermore, we observe that the inference time performance and CNN model size support real-time integration of our approach on general purpose processor-based computing platforms.

Index Terms:

mmWave radar, deep neural network, sub-resolution touch localization, large displays.

I Introduction

Modern displays use capacitive touchscreens for enabling touch-based interaction with the device, wherein touch localization is performed by processing the changes in electrical properties of the touchscreen layers across the display [1]. In general, the touchscreen cost scales linearly with the area of the display covered by the touchscreen. As a result, it becomes prohibitively expensive to use capacitive touchscreens in large displays (for instance, display size $> 15$ inch). Furthermore, since the size of interactive elements (icons, sliders, buttons, etc.) tend to be large on a large screen, the positioning error requirement can often be relaxed from the typical mm-level accuracy to a few cm, without significantly impacting the user experience (UX). This work is focused on enabling accurate touch positioning in the latter regime.

I-A Related Work

To reduce the cost while providing accurate positioning performance, there is significant interest in using alternative technologies such as Ultrasound [2], WiFi [3], Radio Frequency [4, 5], mmWave [6, 7] and Ultrawideband (UWB) [8, 9, 10] radar. Yun et al. [2] designed a device-free system using ultrasound signals to track human finger motion. Their algorithm is based on processing the channel impulse response (CIR) to estimate the absolute distance and the distance displacement using multiple CIRs, resulting in a median tracking error $δ r_{track, 50} = 1$ cm. Wu et al. [3] proposed a sub-wavelength finger motion tracking system using one WiFi transmitter and two WiFi receivers. Their approach used a channel quotient-based feature to detect minute changes in the channel state information (CSI) due to finger movement, resulting in $δ r_{track, 90} = 6$ cm. Wang et al. [4] propose a Radio Frequency Identification (RFID) sensor worn on the finger to demonstrate precise tracking of air handwriting gestures. The authors demonstrated tracking errors of $δ r_{track, 90} = 9.7$ cm and $10.5$ cm in Line-of-Sight (LoS) and Non-LoS (NLoS) conditions respectively. Shangguan [5] proposed an air handwriting system based on two differentially polarized antennas to track the orientation and position of an RFID-tagged pen, achieving a $90^{th}$ percentile position error ( $δ r_{pos, 90}$ ) of $11$ cm. Xiao et al. [7] proposed an RF backscattering-based system to track handwriting traces performed using a stylus in which a RFID tag is embedded. The authors demonstrated $δ r_{track, 50} = 0.49$ cm at a writing speed of $30$ cm/s. Wei [6] designed a $60$ GHz radio-based passive tracking system to position different writing objects such as pen/pencil/marker to obtain $δ r_{pos, 90} = 0.8 / 5 / 15$ cm respectively. Their approach relies on passive backscattering of a single carrier $60$ GHz signal, using which the initial location is acquired. The low tracking error is obtained by tracking its phase over time. In [8], the authors propose an Inertial Measurement Unit (IMU)-UWB radar fusion-based tracking approach to implement a stylus-aided handwriting use-case that achieves $δ r_{pos, 50} = 0.49$ cm. However, in the absence of the IMU, the authors report $δ_{pos, 90} = 6$ cm.

Even though the works [2, 3, 4, 7] report a low tracking error, the position error is high. While this trade-off is acceptable in finger tracking applications such as handwriting recognition, it is unacceptable for on-screen interaction where the performance is dictated by the position error, not by the accuracy of the reconstructed trajectory. On the other hand, works such as [8] that achieve cm-level position accuracy necessitates the use of additional IMU sensors, that adds system/computational complexity, and friction to the UX.

I-B Contributions

In this work, we bridge this gap by proposing a mmWave radar sensor network-based positioning framework that uses a Deep Convolutional Neural Network (CNN) to achieve sub-resolution position accuracy. We build a robot-based testbed for characterizing positioning performance, in which we use a robot-mounted metal finger as the distributed target. We collect data for multiple runs to capture the metal finger’s signature for each radar sensor at different locations of the screen, as well as the corresponding ground truth position using a capacitive touchscreen. The conventional signal processing (CSP)-based approach that uses range calibration coupled with the nonlinear least squares (NLS) algorithm [11] results in $δ r_{pos, 90} = 3.7$ cm. On the other hand, inspired by the LeNet model [12], we design a CNN that outperforms the CSP-based approach and achieves sub-resolution accuracy, with $δ r_{pos, 90} = 1.6$ cm. Finally, we demonstrate that the small model size and CNN inference time makes real-time implementation feasible on general purpose processor-based computing platforms.

II System Design

II-A Working Principle

In contemporary consumer electronic devices, touch localization on a capacitive touchscreen displays rely on changes in the electrical properties of carefully arranged material layers when a finger touches the screen. In essence, the location is estimated by determining the ‘touch cell’ where there is maximum variation in the capacitance [13]. In contrast, contactless methods can also be used if accurate distance [14, 15, 16] and/or angle information [17] of the finger is known relative to sensors with known positions. In this work, we design a low-cost touch positioning system that uses multiple Frequency Modulated Continuous Wave (FMCW) mmWave radar sensors to locate the “finger” (target) on the screen. FMCW mmWave radar sensors are attractive for short range sensing applications because they can be operated at low-power and manufactured with low cost. This is in part due to the low bandwidth ( $\sim 1$ MHz) of the baseband signal processing chain, despite the large sweep bandwidth ( $> 1$ GHz) [18].

However, the main factor that limits accurate finger localization is the limited range resolution ( $Δ r_{res}$ ) of the radar, given by $Δ r_{res} = \frac{c}{2 f_{BW}}$ , where $c$ is the free-space velocity of light, and $f_{BW}$ the chirp bandwidth of the FMCW radar [19]. For example, the radar will not be able to distinguish objects that are closer than $3$ cm (in the radial direction) for $f_{BW} = 5$ GHz. As a result, the finger and the rest of the hand often appear as a single target to each radar sensor, thus resulting in a range error that is dictated by $Δ r_{res}$ .

II-B Experimental Testbed Setup

The main focus of this work is to evaluate the position error of a mmWave radar-based touch solution. To undertake this, we built a testbed whose schematic is shown in Fig. 1(a). The setup is based on a 15.6 inch display, on which $N_{rad} = 4$ radar sensors are placed at the corners of the screen using 3D printed fixtures, as shown in Fig. 1(b). The display is equipped with a capacitive touchscreen, which is interfaced with a dedicated computer to obtain the ground truth (position and time) for each touch event. A metal finger is used as the target which is to be localized by the radar sensor network. The metal finger is mounted on a programmable robot to achieve precise control over its trajectory during the data collection session, as shown in Fig. 1(a). The robot is controlled by another dedicated computer, and is programmed to touch the display on a grid of points, as shown in Fig. 2. The spacing between each point on the grid is approximately $1 cm$ along both vertical and horizontal directions. It is important to note that localization performance in this setup is typically limited by $Δ r_{res}$ , since the metal finger (target, analogous to the human finger) and the metallic robotic arm (analogous to the rest of the hand) on which it is mounted will appear as a single target to the radar. The radar configuration used is shown in Table I. It is worthwhile to note that this waveform is compliant with the latest FCC final rule [20].

TABLE I: Radar Sensor Parameters

Parameter	Value
Waveform type	FMCW
Chirp Bandwidth ( $f_{BW}$ )	$4.874$ GHz
Range Resolution ( $Δ r$ )	$3.075$ cm
Frame Rate ( $f_{r}$ )	$120$ Hz
Number of IF samples/chirp ( $N_{IF}$ )	64
Number of chirps/frame ( $N_{ch}$ )	8
Number of RX antennas/sensor ( $N_{rx}$ )	3
Number of radar sensors ( $N_{rad}$ )	4

II-C Radar Signal Pre-Processing

The signal processing pipeline for generating the feature is shown in Fig. 3. Let $f, s, r, c, j$ and $i$ denote the frame index, IF sample index, range bin, chirp index, RX antenna index, and the sensor index respectively. Each radar sensor transmits the FMCW waveform with parameters shown in Table I. For each frame $f$ , the received waveform is then down-converted to get the intermediate frequency (IF) signal $x_{IF} (f, s, c, j, i)$ , that corresponds to the radar return. From this, the range information corresponding to all scattering objects in the radar’s field of view (FoV) is obtained by computing the beamformed ‘range-FFT’ $x_{r} (f, r, c, j)$ using the following sequence of steps.

		$x_{IF, zp} (f, s, c, j, i) = {\begin{matrix} x_{IF} (f, s, c, j, i) for s < N_{IF} \\ 0 for N_{IF} \leq s \leq N_{os} N_{IF} - 1 \end{matrix},$		(1)
		$x_{w} (f, s, c, j, i) = x_{IF, zp} (f, s, c, j, i) w_{IF} (s),$		(2)
		$x_{r} (f, r, c, j, i) = \sum_{s = 0}^{N_{os} N_{IF} - 1} x_{w} (f, s, c, j, i) e^{\frac{- j 2 π s r}{N_{os} N_{IF}}},$		(3)

for $r = 0, 1, \dots, (\frac{N_{os} N_{IF}}{2} - 1)$ . Here, zero-padding is performed in (1) to shrink the effective range-bin width to $Δ r_{os} = Δ r / N_{os}$ ¹¹Note that while this shrinks the range bin width, the range resolution (i.e. minimum radial distance between two targets such that they appear as two distinct targets) is unchanged. Oversampling in the range domain minimizes the contribution of range quantization error in the system., where $N_{os} = 8$ is the oversampling factor. The zero-padded signal is then used to compute the range-FFT in (3) after a windowing operation $w_{IF} (\cdot)$ . The purpose of the latter is to trade-off the range-FFT sidelobe level with the mainlobe width. It is worthwhile to note that the IF signal contains only the in-phase component and hence, is a real-valued signal. Thus, the range-FFT is symmetric about $r = N_{IF} N_{os} / 2$ . Clutter removal is then used to eliminate the scattered returns from static objects using an IIR moving target indication (MTI) filter to get the post-MTI range-FFT signal $x_{MTI} (f, r, c, j, i)$ , given by

	$x_{c} (f, r, c, j, i)$	$= β x_{r} (f, r, c, j, i) + (1 - β) x_{c} (f, r, c, j, i),$
	$x_{MTI} (f, r, c, j, i)$	$= x_{r} (f, r, c, j, i) - x_{c} (f, r, c, j, i),$		(4)

where $0 < β < 1$ is the IIR filter response parameter and $x_{c} (f, \dots)$ is the clutter estimate for the $f^{th}$ frame. Finally, to keep the feature dimension manageable for real-time implementation, averaging across chirps and boresight beamforming are performed to get the beamformed range-FFT feature $x_{bf} (f, r, i)$ using²²Since the feature is obtained through linear operations on the raw IF signal, averaging across chirps and boresight beamforming can equivalently be performed on the IF signal prior to range-fft as well.

x_{bf} (f, r, i) = \frac{1}{N_{rx} N_{ch}} \sum_{j = 0}^{N_{rx} - 1} \sum_{c = 0}^{N_{ch} - 1} x_{MTI} (f, r, c, j, i) .

(5)

Note that for uniform linear/planar arrays, boresight beamforming is equivalent to signal averaging across RX antennas.

II-D Ground Truth

For each touch event, the capacitive touchscreen-based ground truth (GT) information is composed of the relative location $(p_{t, x}, p_{t, y})$ and touch timestamp ( $t_{GT}$ ), such that $0 \leq p_{t, x} \leq 1$ and $0 \leq p_{t, y} \leq 1$ . The relative locations are converted to locations in the radar coordinate system $𝒓_{GT} = (x_{t, GT}, y_{t, GT})$ using knowledge of the reference radar location w.r.t. the touch area. We use the sign convention shown in Fig 1(b), using which the GT coordinates are calculated using

𝒓_{t, GT} = (r_{t, x} d_{l} + d_{x, 0}, - r_{t, y} d_{w} - d_{y, 0}) .

(6)

III Conventional Signal Processing (CSP)-based Positioning

To improve the accuracy of the conventional signal processing-based estimates, we use range estimates from the beamformed signal, as well as the per-RX signals. Firstly, the post-MTI range-FFT from each RX antenna is averaged across chirps using

x_{MTI, c} (f, r, j, i)

= \frac{1}{N_{ch}} \sum_{c = 0}^{N_{ch} - 1} x_{MTI} (f, r, c, j, i) .

(7)

Then, range estimates corresponding to the per-RX ( ${\hat{r}}_{ij} (f)$ ) as well as beamformed signals ( ${\hat{r}}_{bf, i} (f)$ ) are estimated using

	${\hat{r}}_{ij} (f)$	$= Δ r_{os} \cdot \underset{𝑟}{\arg \max} {\| x_{MTI, c} (f, r, j, i) \|}^{2},$		(8)
	${\hat{r}}_{bf, i} (f)$	$= Δ r_{os} \cdot \underset{𝑟}{\arg \max} {\| x_{bf} (f, r, i) \|}^{2} .$		(9)

To have reliable ranging performance in the presence of low SNR conditions and strong clutter regions (e.g. portion of the hand excluding the finger such as shoulder, torso, palm etc.), we invalidate the range estimate when there is no consensus among the different per-RX range estimates. The range estimate from the $i^{th}$ sensor is computed using

{\hat{r}}_{i} (f) = {\begin{matrix} {\hat{r}}_{bf, i} (f) - r_{cal, i} & if | {\hat{r}}_{ij} (f) - {\hat{r}}_{ik} (f) | \leq Δ r_{th} \\ \forall j \neq k \\ nan & otherwise. \end{matrix},

(10)

where $Δ r_{th}$ is the range consensus tolerance, and $r_{cal, i}$ is the range offset for the $i^{th}$ sensor, which is obtained using a one-time calibration of the localization environment. Let $f_{n}$ be the radar frame index corresponding to the $n^{th}$ touch event. Then, the range estimates are averaged over a window of $N_{w} = 5$ frames, to mitigate the unavailability of range estimates, resulting in a range estimate ${\bar{r}}_{i} (f_{n}) = \frac{1}{N_{val}} \sum_{m = 0}^{N_{w} - 1} {\hat{r}}_{i} (f_{n} - m) 𝟙 [{\hat{r}}_{i} (f_{n} - m) \neq nan]$ , where $N_{val} = | {m | {\hat{r}}_{i} (f_{n} - m) \neq nan} |$ .

Finally, the target’s position estimate ( ${\hat{𝒓}}_{t, CSP} (f)$ ) is obtained by solving the nonlinear least squares (NLS) problem [11]

{\hat{𝒓}}_{t, CSP} (f_{n}) = \underset{𝒓}{\arg \min} \sum_{i = 0}^{N_{s} - 1} {({‖ 𝒓_{𝒊} - 𝒓 ‖}_{2} - {\bar{r}}_{i} (f_{n}))}^{2} .

(11)

IV Deep Neural Network-based Positioning

IV-A Feature Generation

The datastream from each radar sensor and the capacitive touchscreen are collected independently without explicit synchronization. The relatively high sampling rate of the radar ( $120$ Hz) and the touchscreen ( $90$ Hz) w.r.t. the finger movement speed eliminates the need for explicit sensor synchronization. The radar frame indices corresponding to each touch event are found using the GT touch time $t_{GT}$ . Suppose the $i^{th}$ radar frame index corresponding to the touch event is $f_{GT, i}$ For each touch event, the feature contains the beamformed range-FFT for all radar sensors for the frames $ℱ_{i} = {f_{GT, i} - 25, f_{GT, i} - 24, \dots, f_{GT, i} + 25}$ for the range bin indices $ℛ = {0, 1, \dots, R_{\max}}$ , where $R_{\max} = ⌈ \frac{N_{os} \sqrt{d_{l}^{2} + d_{w}^{2}}}{Δ r} ⌉$ is the maximum possible target distance on the display. In our setup, $R_{\max} = 110$ . The feature for the $n^{th}$ touch event is the tensor ${\bar{𝒙}}_{𝒃 𝒇} (ℱ_{i} (n), ℛ) \in ℂ^{61 \times 110 \times 4}$ , comprising of four heatmaps, each corresponding to one of the radar sensors.

IV-B Machine Learning Model Architecture

The proposed CNN-based architecture used in this work is shown in Fig. 4, which is similar to the LeNet-5 architecture [12] except it uses 3 layers of convolution with 3 max-pooling layers rather than 3 layers of convolution with 2 average pooling in the original architecture. We use three cascaded 2D-Convolutional + 2D-Maxpooling layers with filter sizes of $32, 32,$ and $32$ respectively, each with a kernel size of $3 \times 3$ with ReLu activation. The maxpooling layer after convolution is used not only to reduce the feature size but also reduce over-fitting. This improves the generalization and also reduces the memory requirements to host the model on the device. After the convolution and pooling operations, the output is flattened, followed by a dense layer with $16$ units and then the final $2$ dense unit to generate the 2D position coordinates ${\hat{𝒓}}_{𝐭, 𝐌𝐋} = ({\hat{x}}_{t, ML}, {\hat{y}}_{t, ML})$ .

V Experimental Results

V-A Data Collection

The effective touch area on the 15.6 inch display is a rectangular area of length $d_{l} = 34.3$ cm and width $d_{w} = 17.8$ cm. In a single session, the data is collected across a point grid with an arbitrary offset from Radar 0 ( $𝒓_{𝟎}$ ), such that consecutive touch points are separated by $1$ cm along both the axes, as shown in Fig. 2. Data is collected in two stages:

1.

Training Dataset: In this stage, we collected data for 50 sessions. In each session, the robot touched the screen across a $31 \times 16$ grid. After accumulating data across all sessions, the training data has the dimension $(24799, 61, 110, 4)$ .
2.

Validation and Test Datasets: In this stage, we collected data for 15 sessions. In each session, the robot touched the screen across a $30 \times 15$ grid. The grid pattern was designed such that the validation/test touchpoints have a position offset of $𝜹 𝒓^{'} = (0.5 cm, 0.5 cm)$ relative to the training touchpoints. This offset is introduced to test the generalization performance of the machine learning model to unseen data. Finally, touchpoints in the odd/even rows are allocated to the validation/test datasets, as shown in Fig. 5. Hence, the validation and dataset dimensions are $(3600, 61, 110, 4)$ and $(3150, 61, 110, 4)$ respectively.

V-B Range Calibration

For the conventional signal processing approach, the range calibration offset for each radar sensor ( $r_{cal, i}$ ) is estimated using the training dataset. For the $n^{th}$ touchpoint in the training set, the range estimate corresponding to each radar sensor is computed using (10), and is compared to the GT distance $r_{t, GT, i} (n) = {‖ 𝒓_{t, GT} (n) - 𝒓_{𝒊} ‖}_{2}$ . Suppose the corresponding range error is $Δ r_{i} (n) = r_{t, GT, i} (n) - {\bar{r}}_{i} (f_{n})$ . Then, the range calibration offset is estimated by computing the empirical average of the range errors for each sensor, i.e. $r_{cal, i} = \frac{1}{N_{train}} \sum_{n = 0}^{N_{train} - 1} Δ r_{i} (n)$ , where $N_{train} = 24799$ . Note that this is the least-squares (LS) solution of the range-bias estimation problem under the model $\hat{r} (n) = r (n) + ϵ$ for $n = 0, 1, \dots, N_{train} - 1$ , where $ϵ$ is the range bias.

TABLE II: Position Error Performance Comparison on the Test Dataset

Performance Metric	CNN-based	CSP-based
Median Pointwise RMSE	$0.84$ cm	$1.92$ cm
$90 %$ ile Pointwise RMSE	$1.6$ cm	$3.85$ cm
Median Error (all points)	$0.82$ cm	$1.75$ cm
$90 %$ ile Error (all points)	$1.62$ cm	$3.7$ cm

V-C Position Error Performance Comparison

Fig. 6 shows the marginal distribution of position error (marginalized across the entire test dataset) for (a) training/validation/test datasets (CNN-based approach), and (b) the test dataset (CSP-based approach). First, we observe that there is more than a $2 \times$ improvement in the validation/test position error performance in the median and $90^{th}$ percentile value, when using our CNN-based approach. Furthermore, we observe that these values are well within the range resolution of the radar waveform ( $Δ r_{res} = 3.075$ cm). On the other hand, we observe that while range calibration significantly improves the position error performance, the range-calibrated CSP-based approach achieves a $90^{th}$ percentile position error of $3.7$ cm, which is $\sim 20 %$ higher than $Δ r_{res}$ . In alignment with our understanding of the physical limitations imposed by the radar waveform, the CSP-based method is unable to achieve sub-resolution accuracy. The key performance statistics are summarized in Table II.

Fig. 7 shows the heatmaps of RMSE position error for the CNN and CSP-based methods, for different touch regions on the display. These heatmaps are computed for the test dataset, visualized in Fig. 5. We observe that the CNN-based approach results in more than a $3 \times$ improvement in the worst-case (maximum) RMSE, compared to that of the CSP-based approach. In general, there is more than a $2 \times$ improvement in RMSE position error for the CNN-based approach relative to that of CSP, when comparing the point-wise median and $90^{th}$ percentile RMSE, as shown in Table II.

V-D Feasibility of Real-Time Implementation

Inference execution time and model size are two important aspects of the CNN model that determine whether it is suitable for real-time implementation. Our model has $\sim 9 \times 10^{4}$ parameters, with a total size of $\sim 350$ KB. Thus, the memory required to store the model is quite small, and can be fitted on any standard system-on-chip (SoC).

For integrating any ML-based algorithm into a real-time localization system, it is important that the inference time ( $t_{CNN, \inf}$ ) be smaller than the radar frame repetition interval ( $1 / f_{r}$ ). To evaluate feasibility, we used a computer with an Intel i7-1185G7 processor, $16$ GB RAM, and no GPU. Fig. 8 shows the distribution of $t_{CNN, \inf}$ characterized on the test dataset. We observe that the median as well as the $90^{th}$ percentile inference time is $2$ ms, which is considerably smaller than the radar frame interval ( $8.33$ ms, in our system). Thus, the small model size and inference time indicates that our proposed CNN-based approach is well-suited for real-time implementation on general purpose processor-based platforms.

VI Conclusion

In this paper, we proposed a mmWave FMCW radar-based touch localization system, wherein a deep neural network was trained to accurately localize a robot-mounted metal finger. We demonstrated that the CNN-based approach achieved sub-resolution position error, and significantly outperformed conventional signal processing-based algorithms. Finally, we discussed the feasibility of implementing our proposed approach in real-time. The small (a) CNN model size, and (b) inference time on general purpose computing platforms (relative to the radar frame interval), point towards a very strong feasibility for implementation on a real-time localization system.

In this work, we have focused on accurate localization of robot-mounted targets. In general, extending this work to design localization systems for small targets such as accurate touch localization of human finger, and enabling handwriting on non touchscreen displays are worthwhile to enable low-cost technologies for human-screen interaction.

References

[1] C.-L. Lin, C.-S. Li, Y.-M. Chang, T.-C. Lin, J.-F. Chen, and U.-C. Lin, “Pressure Sensitive Stylus and Algorithm for Touchscreen Panel,” Journal of Display Technology, vol. 9, no. 1, pp. 17–23, 2013.

[2] S. Yun, Y.-C. Chen, H. Zheng, L. Qiu, and W. Mao, “Strata: Fine-Grained Acoustic-based Device-Free Tracking,” in Proceedings of the 15th annual international conference on mobile systems, applications, and services, 2017, pp. 15–28.
[3] D. Wu, R. Gao, Y. Zeng, J. Liu, L. Wang, T. Gu, and D. Zhang, “FingerDraw: Sub-Wavelength Level Finger Motion Tracking with WiFi Signals,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 1, pp. 1–27, 2020.
[4] J. Wang, D. Vasisht, and D. Katabi, “RF-IDraw: Virtual Touch Screen in the Air using RF Signals,” ACM SIGCOMM Computer Communication Review, vol. 44, no. 4, pp. 235–246, 2014.
[5] L. Shangguan and K. Jamieson, “Leveraging Electromagnetic Polarization in a Two-Antenna Whiteboard in the Air,” in Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies, 2016, pp. 443–456.
[6] T. Wei and X. Zhang, “mTrack: High-Precision Passive Tracking using Millimeter Wave Radios,” in Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, 2015, pp. 117–129.
[7] N. Xiao, P. Yang, X.-Y. Li, Y. Zhang, Y. Yan, and H. Zhou, “MilliBack: Real-Time Plug-n-Play Millimeter Level Tracking using Wireless Backscattering,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 3, no. 3, pp. 1–23, 2019.
[8] Y. Cao, A. Dhekne, and M. Ammar, “ITrackU: Tracking a Pen-like Instrument via UWB-IMU Fusion,” in Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, 2021, pp. 453–466.
[9] N. Hendy, H. M. Fayek, and A. Al-Hourani, “Deep Learning Approaches for Air-Writing Using Single UWB Radar,” IEEE Sensors Journal, vol. 22, no. 12, pp. 11 989–12 001, 2022.
[10] F. Khan, S. K. Leem, and S. H. Cho, “In-Air Continuous Writing Using UWB Impulse Radar Sensors,” IEEE Access, vol. 8, pp. 99 302–99 311, 2020.
[11] R. Zekavat and R. M. Buehrer, Handbook of Position Location: Theory, Practice and Advances. Wiley-IEEE Press, 2019.
[12] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based Learning Applied to Document Recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[13] Z. Shen, S. Li, X. Zhao, and J. Zou, “CT-Auth: Capacitive Touchscreen-Based Continuous Authentication on Smartphones,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–16, 2023.
[14] J. Yan, C. C. J. M. Tiberius, G. J. M. Janssen, P. J. G. Teunissen, and G. Bellusci, “Review of Range-based Positioning Algorithms,” IEEE Aerospace and Electronic Systems Magazine, vol. 28, no. 8, pp. 2–27, 2013.
[15] R. M. Rao, A. V. Padaki, B. L. Ng, Y. Yang, M.-S. Kang, and V. Marojevic, “ToA-Based Localization of Far-Away Targets: Equi-DOP Surfaces, Asymptotic Bounds, and Dimension Adaptation,” IEEE Transactions on Vehicular Technology, vol. 70, no. 10, pp. 11 089–11 094, 2021.
[16] R. M. Rao and D.-R. Emenonye, “Iterative RNDOP-Optimal Anchor Placement for Beyond Convex Hull ToA-Based Localization: Performance Bounds and Heuristic Algorithms,” IEEE Transactions on Vehicular Technology, vol. 73, no. 5, pp. 7287–7303, 2024.
[17] L. Badriasl and K. Dogancay, “Three-Dimensional Target Motion Analysis using Azimuth/Elevation Angles,” IEEE Transactions on Aerospace and Electronic Systems, vol. 50, no. 4, pp. 3178–3194, 2014.
[18] A. Santra and S. Hazra, Deep Learning Applications of Short-Range Radars. Artech House, 2020.
[19] F. Uysal, “Phase-Coded FMCW Automotive Radar: System Design and Interference Mitigation,” IEEE Transactions on Vehicular Technology, vol. 69, no. 1, pp. 270–281, 2020.
[20]

FCC, “FCC Empowers Short-Range Radars in the 60 GHz Band,” Federal Communications Commission, Final Rule, July 2023. [Online]. Available: https://www.govinfo.gov/content/pkg/FR-2023-07-24/pdf/2023-15367.pdf

Aerospace Electronic and Defense Systems

Thursday, August 8, 2024