Quadrotor Navigation using Reinforcement Learning with Privileged Information

ICRA · 2026

How to reduce the failures that conventional modular pipelines for quadrotor navigation cause using an end-to-end deep learning approach?

Jonathan Lee Abhishek Rathod Kshitij Goel John Stecklein Wennie Tabib

This paper presents a reinforcement learning-based quadrotor navigation method that leverages efficient differentiable simulation, novel loss functions, and privileged information to navigate around large obstacles. Prior learning-based methods perform well in scenes that exhibit narrow obstacles, but struggle when the goal location is blocked by large walls or terrain. In contrast, the proposed method utilizes time-of-arrival (ToA) maps as privileged information and a yaw alignment loss to guide the robot around large obstacles. The policy is evaluated in photo-realistic simulation environments containing large obstacles, sharp corners, and dead-ends. Our approach achieves an 86% success rate and outperforms baseline strategies by 34%. We deploy the policy onboard a custom quadrotor in outdoor cluttered environments both during the day and night. The policy is validated across 20 flights, covering 589 meters without collisions at speeds up to 4 m/s.

Figures

**Method Teaser** This paper develops and deploys an end-to-end policy to navigate in challenging environments. The approach outperforms the state of the art by 34%. An example trajectory is captured using long-exposure photography.

**Differentiable Dynamics Training** Differentiable dynamics enables direct policy updates by performing gradient descent on the loss function.

**Policy Architecture** The end-to-end planning and control architecture is trained as a single neural network. Feature extractors process each input before they are flattened and summed together. A GRUCell helps to maintain consistent action predictions over time.

**Time-of-Arrival Guidance** (a) Heatmap of time-of-arrival (ToA) computed using fast marching method (FMM) and overlayed gradient field. (b) Shortest paths along ToA gradient from starting points (green dots) to the target (red dot) guides robot around concave obstacle regions.

**Training Environments** Top down view of two cylinder shaped training environments with random primitive obstacles and starting points at a fixed radius from the goal (blue). Trajectories illustrate paths following the timeof-arrival map (yellow to blue).

**Attitude Control Ablation** Comparison of attitude control performance without (a) and with (b) the derivative feedback term, ωd, in the attitude controller.

**Planner Success Rates** Planner success rate and failure modes across 11 diverse environments. The proposed method (Ours) achieves the highest success rate and lowest collision rate compared to the baseline [2] and the ablated policy trained without privileged information (yaw w/o ToA). The Mine environment features a maze-like corridor which results in poor performance from all planners.

**Trajectory Comparison** Trajectories overlaid on ground-truth point clouds, with a cross-section shown for clarity. Trajectories are colored by speed (red to yellow, up to vmax = 3 m/s) with body frames every 2 s. Start and goal are marked in blue and magenta. Successful trials are marked with a ✓ in Industry 2 (top) and Cave 2 (bottom) environments. ToA maps serve as an inductive bias during training only and are not available during simulation or hardware evaluation.

**Gravity Randomization Ablation** Hardware ablation comparing policies trained without (a) and with (b) gravity randomization. Top row: flight snapshots with goal setpoint in red. In (a), the start and goal positions coincide, but upon entering autonomous mode, the vehicle quickly loses altitude. With randomization (b), the robot gains altitude and reaches the goal setpoint. Middle row: plot of z-height from motion capture showing significantly reduced altitude error with the gravity randomized policy. Bottom row: plot of normalized thrust predicted by the policy, where the gravity randomized policy initially outputs 1.3g, substantially higher than the policy without gravity randomization, compensating for modeling inaccuracies.

**Forest Flight Test** Outdoor obstacle avoidance test under tree canopy (Table III Forest Flight 4). The policy predicts up to 30° in yaw to navigate through dense underbrush with speeds up to 3.8 m/s. From top to bottom: VINS trajectory overlaid on terrain map, onboard RGB images, policy depth input (inverted and max pooled), and velocity profile.

**Night Flight Tests** Outdoor obstacle avoidance tests at night using LED illumination and long-exposure photography.

Acknowledgments

The authors would like to thank Ankit Khandelwal for contributions to the codebase and Edsel Burkholder for field testing support. This material is based upon work supported in part by the Army Research Laboratory and the Army Research Office under contract/grant number W911NF-25-2-0153.

BibTeX

@inproceedings{quadrotor-navigation-rl-2026,
  title={Quadrotor Navigation using Reinforcement Learning with Privileged Information},
  author={Jonathan Lee, Abhishek Rathod, Kshitij Goel, John Stecklein, and Wennie Tabib},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA), 2026},
  year={2026}
}