diff --git a/_data/navigation.yml b/_data/navigation.yml index 6600d2a4..90a0f195 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -139,6 +139,8 @@ wiki: url: /wiki/sensing/azure-block-detection/ - title: DWM1001 UltraWideband Positioning System url: /wiki/sensing/ultrawideband-beacon-positioning.md + - title: Perception via Thermal Imaging + url: /wiki/sensing/thermal-perception/ - title: Controls & Actuation url: /wiki/actuation/ children: diff --git a/assets/images/Moge_relative_thermal.png b/assets/images/Moge_relative_thermal.png new file mode 100644 index 00000000..ddc2d30f Binary files /dev/null and b/assets/images/Moge_relative_thermal.png differ diff --git a/assets/images/foundation_stereo.png b/assets/images/foundation_stereo.png new file mode 100644 index 00000000..1e3c90d7 Binary files /dev/null and b/assets/images/foundation_stereo.png differ diff --git a/wiki/sensing/thermal-perception.md b/wiki/sensing/thermal-perception.md new file mode 100644 index 00000000..a1f95890 --- /dev/null +++ b/wiki/sensing/thermal-perception.md @@ -0,0 +1,255 @@ +--- +# Jekyll 'Front Matter' goes here. Most are set by default, and should NOT be +# overwritten except in special circumstances. +# You should set the date the article was last updated like this: +date: 2025-04-29 # YYYY-MM-DD +# This will be displayed at the bottom of the article +# You should set the article's title: +title: Perception via Thermal Imaging +# The 'title' is automatically displayed at the top of the page +# and used in other parts of the site. +--- + +In this article, we discuss strategies to implement key steps in a robotic perception pipeline using thermal cameras. +Specifically, we discuss the conditions under which a thermal camera provides more utility than an RGB camera, followed +by implementation details to perform camera calibration, dense depth estimation and odometry using thermal cameras. + +## Why Thermal Cameras? + +Thermal cameras are useful in key situations where normal RGB cameras fail - notably, in perceptual degradation like +smoke and darkness. +Furthermore, unlike LiDAR and RADAR, thermal cameras do not emit any detectable radiation. +If your robot is expected to operate in darkness and smoke-filled areas, thermal cameras are a means for your robot to +perceive the environment in nearly the same way as visual cameras would in ideal conditions. + +## Why Depth is Hard in Thermal + +Depth perception — inferring the 3D structure of a scene — generally relies on texture-rich, high-contrast inputs. +Thermal imagery tends to violate these assumptions: + +- **Low Texture**: Stereo matching algorithms depend on local patches with distinctive features. Thermal scenes often + lack these. +- **High Noise**: Infrared sensors may introduce non-Gaussian noise, which confuses pixel-level correspondence. +- **Limited Resolution**: Consumer-grade thermal cameras are often <640×480, constraining disparity accuracy. +- **Spectral Domain Shift**: Models trained on RGB datasets fail to generalize directly to the thermal domain. + +_________________________ + +## Calibration + +Calibration is the process by which we can estimate the internal and external parameters of a camera. Usually, the +camera intrinsics matrix has the following numbers + +- fx, fy - This the focal length of the camera in the x and y directions **in the camera's frame**. px/distance_unit +- cx, cy OR px, py - The principal point or the optical center of the image +- distortion coefficients (2 - 6 numbers depending on distortion model used) + +Additionally, we must also estimate camera extrinsics which is the pose of the camera relative to another sensor - the +body frame of a robot is defined to be the same as the IMU, or another camera in the case of multi-camera system + +- This will be in the form of series of 12 numbers - 9 for the rotation matrix and 3 for the translation +- *NOTE*: BE VERY CAREFUL OF COORDINATE FRAMES +- If using more than one sensor, timesync will help you. + +- Calibrating thermal cameras is quite similar to calibrating any other RGB sensor. To accomplish this you must have a + checkerboard pattern, Aruco grid or some other calibration target. + - A square checkerboard is not ideal because it is symmetrical and it is hard for the algorithm to estimate if the + orientation of the board has changed. + - An aruco grid gives precise orientation and is the most reliable option but is not necessary. + +General tips + +- For a thermal camera you will need to use something with distinct hot and cold edges, eg: a thermal checkerboard +- Ensure that the edges on the checkerboard are visible and are not fuzzy. If they are adjust the focus, wipe the lens + and check if there is any blurring being applied +- Ensure the hot parts of the checkerboard are the hottest things in the picture. This will make it easier to detect the + checkerboard +- Thermal cameras by default give 16bit output. You will need to convert this to an 8bit grayscale image. +- Other than the checkerboard, the lesser things that are visible in the image, the better your calibration will be +- If possible, preprocess your image so that other distracting features will be ignored + +### Camera Intrinsics + +- Calibrating thermal camera intrinsics will give you fx, fy, cx, cy, and the respective distortion coefficients + +1. Heat up the checkerboard +2. Record a rosbag with the necessary topics +3. Preprocess your images +4. Run them through OpenCV or Kalibr. There are plenty of good resources online. + +Example output from Kalibr: + +```text + cam0: + cam_overlaps: [] + camera_model: pinhole + distortion_coeffs: [-0.3418843277284295, 0.09554844659447544, 0.0006766728551819399, 0.00013250437150091342] + distortion_model: radtan + intrinsics: [404.9842534577856, 405.0992911907136, 313.1521147858522, 237.73982476898445] + resolution: [640, 512] + rostopic: /thermal_left/image +``` + +### Thermal Camera peculiarities + +- Thermal Cameras are extremely noisy. There are ways you can reduce this noise +- **Camera gain calibration:** The gain values on the camera are used to reduce or increase the intensity of the noise + in the image. + - The noise is increased if you are trying to estimate the static noise and remove it from the image (FFC) + +- **Flat Field Correction**: FFC is used to remove any lens effects in the image such as vignetting and thermal patterns + in the images + - FFC is carried out by placing a uniform object in front of the camera and taking a picture + - Then the noise patterns and then vignetting effects are estimated and then removed from the cameras + - The FLIR thermal cameras constantly "click" which them placing a shutter in front of the sensor and taking picture + and correcting for any noise + - The FLIR documentation describes Supplemental FFC (SFFC) which is the user performing FFC manually. It is + recommended that this is performed when the cameras are in their operating conditions + +### Camera Extrinsics + +- Relative camera pose is necessary to perform depth estimation. Kalibr calls this a camchain +- Camera-IMU calibration is necessary to perform sensor fusion and integrate both sensor together. This can be estimated + using CAD as well. +- Time-sync is extremely important for this because the sensor readings need to be at the exact same time for the + algorithm to effectively estimate poses. +- While performing extrinsics calibration, ensure that all axes are excited (up-down, left-right, fwd-back, roll, pitch, + yaw) sufficiently. ENSURE that you move slow enough that there is no motion blur with the calibration target but fast + enough to excite the axes enough. + +________ + +## Our Depth Estimation Pipeline Evolution + +### 1. **Stereo Block Matching** + +We started with classical stereo techniques. Given left and right images $I_L, I_R$, stereo block matching computes +disparity $d(x, y)$ using a sliding window that minimizes a similarity cost (e.g., sum of absolute differences): + +$d(x, y) = argmin_d \space Cost(x, y, d)$ + +In broad strokes, this brute force approach compares blocks from $I_L$ and $I_R$. For each block it computes a cost +based on the pixel to pixel similarity (using a loss between feature descriptors generally). Finally, once a block match +is found, the disparity is found by checking how much each pixel has moved in the x direction. + +As you can imagine, this approach is simple and lightweight. However, it is dependent on many things such as the noise +in your images, the contrast separation, and will struggle to find accurate matches when looking at textureless and +colorless inputs (like a wall in a thermal image). The algorithm performed better than expected, but we chose not to go +ahead with it. + +--- + +### 2. **Monocular Relative Depth with MoGe** + +If you are using a single camera setup, this is called a monocular approach. One issue is that this problem is ill +posed. For example, objects can be placed at twice the distance and be scaled to twice their size to yield the same +image. There is a reflection of the scale ambiguity that exists in any monocular depth estimation method. Therefore, +learning based models are employed to "guess" the right depth (based on data driven priors like the usual chairs). One +such model is MoGe (Monocular Geometry) which estimates *relative* depth $z'$ from a single image. These estimates are +affine-invariant, +meaning we need to apply a scale and a shift to retrieve metric depth: + +$z = s \cdot z' + t$ + +This means they look visually coherent (look at the image below on the right), but the ambiguity limits 3D metric use ( +SLAM based applications). + +![Relative Depth on Thermal Images](/assets/images/Moge_relative_thermal.png) + +--- + +### 3. **MADPose Solver for Metric Recovery** + +To determine global scale and shift, we incorporated a stereo system and inferred relative depth from both. We then +utilized the MADPose solver to find the scale and shift of both relative depth images to make them align, i.e. both +depthmaps, after being made metric, should tell us the same 3D structure. This optimizer also estimates other properties +such as extrinsics between the cameras, solving for more unknowns than necessary. Additionally, there is no +temporal constraint imposed (you are looking at mostly the same things between $T$ and $T+1$ timesteps). This meant that +the metric depth that we recovered would keep changing significantly across frames, resulting in pointclouds of +different sizes and distances across timesteps. This method, while sound in theory, did not work out ver well in +practise. + +--- + +### 4. **Monocular Metric Depth Predictors** + +We also tested monocular models trained to output metric depth directly. This problem would be the most ill-posed +problem as you would definitely overfit to the baseline of your training data and the approach would fail to generalize +to other baselines. These treat depth as a regression problem from single input $I$: + +$z(x, y) = f(I(x, y))$ + +Thermal's lack of depth cues and color made the problem even harder, and the models performed poorly. + +--- + +### 5. **Stereo Networks Trained on RGB (e.g., MS2, KITTI)** + +Alternatively, when a dual camera setup is used, we call it a stereo approach. This inherently is a much simpler problem +to solve as you have two rays that intersect at the point of capture. I encourage looking at the following set of videos +to understand epipolar geometry and the formualation behind the stereo camera +setup [Link](https://www.youtube.com/watch?v=6kpBqfgSPRc). + +We evaluated multiple pretrained stereo disparity networks. However, there were a lot of differences between the +datasets used for pretraining and our data distribution. These models failed to generalize due to: + +- Domain mismatch (RGB → thermal) +- Texture reliance +- Exposure to only outdoor content +- Reduced exposure + +--- + +## Final Approach: FoundationStereo + +Our final and most successful solution was [FoundationStereo](https://github.com/NVlabs/FoundationStereo), a foundation +model for depth estimation that generalizes to unseen domains without retraining. It is trained on large-scale synthetic +stereo data and supports robust zero-shot inference. + +### Why It Works: + +- **Zero-shot Generalization**: No need for thermal-specific fine-tuning. +- **Strong Priors**: Learned over large datasets of scenes with varied geometry and lighting. (These variations helped + overcome RGB to thermal domain shifts and textureless cues) +- **Robust Matching**: Confidence estimation allows the model to ignore uncertain matches rathern than hallucinate. +- **Formulation**: Formulating the problem as dense depth matching problem also served well. This allowed generalization + to any baseline by constraining the output to the pixel space. + +Stereo rectified thermal image pairs are given to FoundationStereo, which gives us clean disparity maps (image space). +We +recover metric depth using the intrinsics of the camera and the baseline. Finally, we can reproject this into the 3D +space to get consistent point clouds: + +$ +z = \frac{f \cdot B}{d} +$ + +Where: + +- $f$ = focal length, +- $B$ = baseline between cameras, +- $d$ = disparity at pixel. + +An example output is given below (thermal preprocessed on the top left, disparity is middle left, and the metric +pointcloud is on the right). + +![Metric Depth using Foundation Models](/assets/images/foundation_stereo.png) + +## Lessons Learned + +1. **Texture matters**: Thermal's low detail forces the need for models that use global context. +2. **Don’t trust pretrained RGB models**: They often don’t generalize without retraining. +3. **Stereo > Monocular for thermal**: Even noisy stereo is better than ill-posed monocular predictions. +4. **Foundation models are promising**: Large-scale pretrained vision backbones like FoundationStereo are surprisingly + effective out-of-the-box. + +## Conclusion + +Recovering depth from thermal imagery is hard — but not impossible. While classical and RGB-trained methods struggled, +modern foundation stereo models overcame the domain gap with minimal effort. Our experience suggests that for any team +facing depth recovery in non-traditional modalities, foundation models are a compelling place to start. + +## See Also + +- The [Thermal Cameras wiki page](https://roboticsknowledgebase.com/wiki/sensing/thermal-cameras/) goes into more depth + about how thermal cameras function. pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy