Kshitij Goel Robotics Researcher

Zero-Shot Metric Depth Estimation via Monocular Visual-Inertial Rescaling for Autonomous Aerial Navigation

ICRA · 2026

Direct metric depth estimation from a monocular RGB image is error-prone in out of distribution scenarios (e.g. dusty environment). Can we leverage an IMU to increase the accuracy of metric depth estimation at inference time?

This paper presents a methodology to predict metric depth from monocular RGB images and an inertial measurement unit (IMU). To enable collision avoidance during autonomous flight, prior works either leverage heavy sensors (e.g., LiDARs or stereo cameras) or data-intensive and domain-specific fine-tuning of monocular metric depth estimation methods. In contrast, we propose several lightweight zero-shot rescaling strategies to obtain metric depth from relative depth estimates via the sparse 3D feature map created using a visual-inertial navigation system. These strategies are compared for their accuracy in diverse simulation environments. The best performing approach, which leverages monotonic spline fitting, is deployed in the real-world on a compute-constrained quadrotor. We obtain on-board metric depth estimates at 15 Hz and demonstrate successful collision avoidance after integrating the proposed method with a motion primitives-based planner.

Figures

Hardware Collision Avoidance
Hardware Collision Avoidance Image and data corresponding to one hardware experiment to demonstrate collision avoidance during autonomous navigation by using data from a monocular camera and IMU to rescale relative depth measurements from an MDE network and obtain metric depth. (a) illustrates the quadrotor aerial robot navigating in the industrial tunnel environment. (b) illustrates the trajectory plotted in red on top of the environment reconstructed from survey-grade FARO scans. This represents the trajectory for the entire flight trial. The robot uses the proposed approach to select actions that avoid the two pillars in the environment. (c) shows the features tracked in the forward-facing camera, which are used to rescale the predicted image. (d) plots the point cloud generated using our approach in colors ranging from red (closer) to purple (further away) as well as the colorized point cloud from a RealSense sensor generated using active stereo.
Rescaling Approach Overview
Rescaling Approach Overview Overview of the approach to rescale predicted depth from an MDE network using a metrically accurate 3D sparse feature map from a VIN system. The RGB camera image is used by an MDE network to predict a depth image consisting of relative depth estimates. The RGB camera images and IMU data are also used to produce a sparse set of metrically accurate 3D features. We leverage a monotonic spline to rescale the relative depth estimates so that they are metrically accurate. The resulting rescaled depth image is used for navigation.
Ablation Study Images
Ablation Study Images Examples of images used for ablation study derived from photo-realistic Flightmare [25] simulator. The environments strike a balance between confined spaces (see (a)–(b)) and open spaces where the sky may be seen at a distance (see (c)).
Simulated Navigation Performance
Simulated Navigation Performance Performance comparison of autonomous navigation in the simulated sewer environments using our rescaled metric depth estimation approach (shown in grey) and the depth camera data (shown in blue). The proposed approach suffers minor degradation compared to the depth camera image, which is expected as scale is estimated using the fused monocular camera and IMU data.
Simulated Evaluation Scenes
Simulated Evaluation Scenes Representative simulated scenes used to evaluated the proposed approach. The colors ranging from red (closer) to purple (further away) represent the metric depth estimates after rescaling relative to the robots position (shown as a red quadrotor). The predicted depth values closely align with the ground truth, demonstrating the accuracy of the methodology.
Industrial Tunnel Experiment
Industrial Tunnel Experiment Images and data from one of the hardware experiments. (a) provides a third-person view of the robot navigating a dusty industrial tunnel environment. (b) illustrates the corner detections. (c) illustrates how dust affects the active stereo depth image from the RealSense. (d) provides the results from our proposed method.
Environmental Detail Comparison
Environmental Detail Comparison Comparison of environmental details during hardware experiments. (a) RGB image from the RealSense color camera (b) provides a view of the depth image from the RealSense. (c) provides the results from our proposed method.

Acknowledgments

The authors would like to thank Jonathan Lee for valuable discussions and insights. This material is based upon work supported in part by the Army Research Laboratory and the Army Research Office under contract/grant number W911NF-25-2-0153.

BibTeX

@inproceedings{zero-shot-depth-rescaling-2026,
  title={Zero-Shot Metric Depth Estimation via Monocular Visual-Inertial Rescaling for Autonomous Aerial Navigation},
  author={Steven Yang, Xiaoyu Tian, Kshitij Goel, and Wennie Tabib},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA), 2026},
  year={2026}
}