Recent advancements in robot navigation, particularly with end-to-end learning approaches such as reinforcement learning (RL), have demonstrated strong performance. However, successful navigation still depends on two key capabilities: mapping and planning (explicitly or implicitly). Classical approaches rely on explicit mapping pipelines to register egocentric observations into a coherent map. In contrast, end-to-end learning often achieves this implicitly through recurrent neural networks (RNNs) that fuse current and historical observations into a latent space for planning.
While existing architectures, such as LSTM and GRU, can capture temporal dependencies, our findings reveal a critical limitation: their inability to effectively perform spatial memorization. This capability is essential for integrating sequential observations from varying perspectives to build spatial representations that support planning. To address this, we propose Spatially-Enhanced Recurrent Units (SRUs) β a simple yet effective modification to existing RNNs that enhance spatial memorization.
We further introduce an attention-based network architecture integrated with SRUs, enabling long-range mapless navigation using a single forward-facing stereo camera. We also employ regularization techniques to facilitate robust end-to-end recurrent training via RL. Experimental results show 23.5% overall improvement in long-range navigation compared to existing RNNs. With SRU memory, our method outperforms RL baselines by 29.6% and 105.0% respectively, across diverse environments requiring long-horizon mapping and memorization. Finally, we address the sim-to-real gap by leveraging large-scale pretraining on synthetic depth data, enabling zero-shot transfer for deployment across diverse and complex real-world environments.
We reveal that standard RNNs (LSTM, GRU, S4, Mamba-SSM), while excelling at temporal dependencies, fundamentally struggle with spatial memorization from varying perspectives.
A simple yet effective modification using element-wise multiplication ("star operation") that enables implicit spatial transformation learning from egocentric observations.
Two-stage spatial attention (self + cross-attention) compresses high-dimensional visual features into task-relevant spatial cues for efficient recurrent memory.
Large-scale pretraining with parallelized depth-noise augmentation enables deployment across offices, terraces, and forests without fine-tuning.
While standard RNNs (LSTM, GRU) excel at capturing temporal dependencies, our findings reveal a critical limitation: they struggle to effectively perform spatial memorizationβthe ability to transform and integrate sequential observations from varying perspectives into coherent spatial representations.
(a) Temporal memorization: Both LSTM and SRU converge quickly on temporal sequences.
(b) Spatial memorization: SRU converges effectively while LSTM struggles to learn spatial transformations.
(c) LSTM spatial mapping: Fails to accurately register and memorize landmark positions from earlier observations.
(d) SRU spatial mapping: Successfully transforms and memorizes landmark positions across varying perspectives.
We propose a simple yet effective modification to standard LSTM and GRU architectures by introducing an additional spatial transformation operation. Inspired by homogeneous transformations and the "star operation" (element-wise multiplication), SRUs enable the network to implicitly learn spatial transformations from sequences of egocentric observations.
An additional learnable transformation term aligns and memorizes observations from varying perspectives while preserving temporal memorization capabilities.
Self-attention enriches visual features with global context, while cross-attention (queried by robot state and goal) compresses high-dimensional features into task-relevant spatial cues, enabling end-to-end navigation learning directly from ego-centric observations.
A RegNet + FPN backbone pretrained on large-scale synthetic depth data (TartanAir) with a parallelized depth-noise model enables end-to-end navigation from ego-centric depth alone, achieving robust zero-shot sim-to-real transfer without explicit mapping.
End-to-end architecture: Pretrained depth encoder β Self-attention β Cross-attention β SRU β MLP with TC-Dropout β Velocity commands
Our cross-attention mechanism learns to focus on task-relevant spatial features based on the robot's current state β without any explicit supervision. This emergent behavior demonstrates the policy's ability to selectively attend to obstacles and navigable space in real-time.
When the robot's movement direction changes, the attention weights automatically shift to emphasize spatial regions relevant to the new heading. Different colors represent distinct attention heads.
Left Turn: Attention highlights the left region, focusing on obstacles in the new heading direction.
Straight: Attention emphasizes the central region, capturing obstacles directly ahead.
Right Turn: Attention focuses on the right region, concentrating on obstacles in the turn direction.
The learned attention mechanism generalizes to diverse real-world scenarios without any fine-tuning, dynamically highlighting relevant spatial features for navigation.
Attention focuses on furniture, walls, and doorways for indoor navigation.
Attention adapts to stairs, railings, and varied terrain structures.
Attention identifies trees, vegetation, and natural obstacles for long-range navigation.
vs. Standard RNNs
(LSTM, GRU)
vs. Explicit Mapping
(EMHP baseline)
vs. Stacked Frames
(GTRL baseline)
| Method | Maze | Pillars | Stairs | Pits | Overall |
|---|---|---|---|---|---|
| LSTM | 70.3% | 78.2% | 33.1% | 72.7% | 63.5% |
| GRU | 68.1% | 73.6% | 35.7% | 66.7% | 61.0% |
| SRU (Ours) | 76.0% | 81.0% | 82.8% | 75.6% | 78.9% |
SRU achieves 2.5x improvement in challenging stair environments requiring precise 3D spatial memory.
SRU implicit memory outperforms both historical observation stacking and explicit mapping approaches.
Simply replacing stacked observations with SRU implicit memory yields 73% improvement β proving SRU is the key differentiator for long-range navigation.
Visual comparison of navigation policies with SRU vs LSTM memory across four different environment types.
SRU Navigation Trajectories
Maze: SRU successfully reroutes from dead-ends with spatial memory.
Pillars: SRU takes efficient, shorter paths by memorizing obstacle locations.
Stairs: SRU navigates 3D structures smoothly with precise spatial awareness.
Pits: SRU recalls and avoids hazards throughout navigation.
LSTM Navigation Trajectories
Maze: LSTM loops in dead-ends, struggling to backtrack effectively.
Pillars: LSTM navigates successfully but takes longer, less efficient paths.
Stairs: LSTM oscillates and struggles with 3D spatial memory.
Pits: LSTM fails to spatially register pit locations from varying perspectives.
Unlike explicit memory with fixed horizons (~20m), SRU's implicit recurrent memory maintains >80% success rate for 50m+ distances and >70% for 120m trajectories.
Deployed on Unitree B2W with ZedX camera across indoor offices, outdoor terraces, and forestsβno fine-tuning required.
SRU policies successfully backtrack through dead-end corridors and adapt to dynamic environment changes, where standard LSTM loops indefinitely.
Our mapless navigation policy achieves zero-shot transfer from simulation to diverse real-world environments β relying solely on ego-centric depth perception from a single forward-facing stereo camera. No explicit maps, no fine-tuning on real-world data. Deployed on Unitree B2W with ZedX stereo camera and NVIDIA Jetson AGX Orin.
Tested across four diverse real-world environments demonstrating robust zero-shot generalization.
Unitree B2W legged-wheel robot (50 Hz locomotion)
ZedX stereo camera (105Β° FoV, 10m range)
NVIDIA Jetson AGX Orin (5 Hz policy)
LiDAR-based state estimation (DLIO)
Direct deployment from simulation with 100K+ synthetic pretraining β no real-world fine-tuning
70m+ goals (2.3x training range) and 100m+ traversal in forest
Forward-facing depth only β no maps or 360Β° sensing
The complete SRU implementation is organized into modular repositories, enabling researchers and practitioners to use components independently or together. All repositories are actively maintained and fully documented.
PyTorch implementation of Spatially-Enhanced Recurrent Units (SRU) modules with spatial memory mechanisms.
End-to-end RL training framework for navigation networks with SRU+attention architecture. Includes hyperparameters and regularization techniques.
IsaacLab task extension with diverse navigation environments, depth observation with pretrained encoding, sparse reward settings, and terrain variations.
Large-scale self-supervised depth perception pretraining on 100K+ synthetic environments. Includes pre-trained models and sensor preprocessing pipelines.
Real robot deployment code for Unitree B2W with ZedX camera. Includes ROS2 integration, Gazebo simulation, sim-to-real transfer techniques, inference pipelines, and hardware setup guides.
Use the complete training pipeline with IsaacLab simulation and RL framework to reproduce results or extend the method.
Deploy pre-trained models on real robots with minimal setup using our deployment code.
Extend SRU capabilities by integrating the core module into your own projects.
Our code release leverages two powerful open-source frameworks for simulation and reinforcement learning:
A modular and extensible simulation framework for robotics research built on Nvidia Omniverse. We use IsaacLab for fast, physically-accurate navigation environment simulation with support for dynamic obstacle configurations and diverse terrain variations.
An open-source reinforcement learning library from Robotic Systems Lab (RSL) at ETH Zurich optimized for legged locomotion and navigation tasks. We use rsl_rl for efficient PPO training with GPU acceleration and advanced feature normalization.