Spatially-Enhanced Recurrent Units

Abstract

Recent advancements in robot navigation, particularly with end-to-end learning approaches such as reinforcement learning (RL), have demonstrated strong performance. However, successful navigation still depends on two key capabilities: mapping and planning (explicitly or implicitly). Classical approaches rely on explicit mapping pipelines to register egocentric observations into a coherent map. In contrast, end-to-end learning often achieves this implicitly through recurrent neural networks (RNNs) that fuse current and historical observations into a latent space for planning.

While existing architectures, such as LSTM and GRU, can capture temporal dependencies, our findings reveal a critical limitation: their inability to effectively perform spatial memorization. This capability is essential for integrating sequential observations from varying perspectives to build spatial representations that support planning. To address this, we propose Spatially-Enhanced Recurrent Units (SRUs) — a simple yet effective modification to existing RNNs that enhance spatial memorization.

We further introduce an attention-based network architecture integrated with SRUs, enabling long-range mapless navigation using a single forward-facing stereo camera. We also employ regularization techniques to facilitate robust end-to-end recurrent training via RL. Experimental results show 23.5% overall improvement in long-range navigation compared to existing RNNs. With SRU memory, our method outperforms RL baselines by 29.6% and 105.0% respectively, across diverse environments requiring long-horizon mapping and memorization. Finally, we address the sim-to-real gap by leveraging large-scale pretraining on synthetic depth data, enabling zero-shot transfer for deployment across diverse and complex real-world environments.

Key Contributions

🧠

Identifying RNN Spatial Limitations

We reveal that standard RNNs (LSTM, GRU, S4, Mamba-SSM), while excelling at temporal dependencies, fundamentally struggle with spatial memorization from varying perspectives.

🔧

Spatially-Enhanced Recurrent Units

A simple yet effective modification using element-wise multiplication ("star operation") that enables implicit spatial transformation learning from egocentric observations.

🎯

Attention-Based Architecture

Two-stage spatial attention (self + cross-attention) compresses high-dimensional visual features into task-relevant spatial cues for efficient recurrent memory.

🌍

Zero-Shot Sim-to-Real Transfer

Large-scale pretraining with parallelized depth-noise augmentation enables deployment across offices, terraces, and forests without fine-tuning.

Methodology

The Problem: RNNs Struggle with Spatial Memory

While standard RNNs (LSTM, GRU) excel at capturing temporal dependencies, our findings reveal a critical limitation: they struggle to effectively perform spatial memorization—the ability to transform and integrate sequential observations from varying perspectives into coherent spatial representations.

(a) Temporal memorization: Both LSTM and SRU converge quickly on temporal sequences.

(b) Spatial memorization: SRU converges effectively while LSTM struggles to learn spatial transformations.

(c) LSTM spatial mapping: Fails to accurately register and memorize landmark positions from earlier observations.

(d) SRU spatial mapping: Successfully transforms and memorizes landmark positions across varying perspectives.

Our Solution: Spatially-Enhanced Recurrent Units (SRUs)

We propose a simple yet effective modification to standard LSTM and GRU architectures by introducing an additional spatial transformation operation. Inspired by homogeneous transformations and the "star operation" (element-wise multiplication), SRUs enable the network to implicitly learn spatial transformations from sequences of egocentric observations.

Implicit Spatial Transformation

An additional learnable transformation term aligns and memorizes observations from varying perspectives while preserving temporal memorization capabilities.

Perception & Attention Architecture for RL-Based Navigation

Two-Stage Spatial Attention

Self-attention enriches visual features with global context, while cross-attention (queried by robot state and goal) compresses high-dimensional features into task-relevant spatial cues, enabling end-to-end navigation learning directly from ego-centric observations.

Pretrained Depth Encoder

A RegNet + FPN backbone pretrained on large-scale synthetic depth data (TartanAir) with a parallelized depth-noise model enables end-to-end navigation from ego-centric depth alone, achieving robust zero-shot sim-to-real transfer without explicit mapping.

Training Strategy

Sparse Rewards: Time-based rewards at episode end with random early checks to promote exploration without intermediate reward distractions
Deep Mutual Learning (DML): Two policies trained in parallel with KL-divergence distillation to prevent early overfitting and encourage robust spatial-temporal feature learning
Temporally Consistent Dropout: Consistent dropout masks across time steps during rollout and training for stable recurrent memory learning

Network Architecture

End-to-end architecture: Pretrained depth encoder → Self-attention → Cross-attention → SRU → MLP with TC-Dropout → Velocity commands

Attention Visualization

Our cross-attention mechanism learns to focus on task-relevant spatial features based on the robot's current state — without any explicit supervision. This emergent behavior demonstrates the policy's ability to selectively attend to obstacles and navigable space in real-time.

Attention Shifts with Robot State

When the robot's movement direction changes, the attention weights automatically shift to emphasize spatial regions relevant to the new heading. Different colors represent distinct attention heads.

Turning Left

Left Turn: Attention highlights the left region, focusing on obstacles in the new heading direction.

Going Straight

Attention weights when robot goes straight

Straight: Attention emphasizes the central region, capturing obstacles directly ahead.

Turning Right

Attention weights when robot turns right

Right Turn: Attention focuses on the right region, concentrating on obstacles in the turn direction.

💡

Key Insight: These attention patterns emerge naturally during end-to-end reinforcement learning — no explicit attention supervision is required. The network learns to extract task-relevant spatial cues autonomously.

Zero-Shot Generalization Across Real-World Environments

The learned attention mechanism generalizes to diverse real-world scenarios without any fine-tuning, dynamically highlighting relevant spatial features for navigation.

Indoor Office

Attention focuses on furniture, walls, and doorways for indoor navigation.

Outdoor Terrace

Attention weights in terrace environment

Attention adapts to stairs, railings, and varied terrain structures.

Forest

Attention identifies trees, vegetation, and natural obstacles for long-range navigation.

Results & Performance

+23.5%

vs. Standard RNNs
(LSTM, GRU)

+29.6%

vs. Explicit Mapping
(EMHP baseline)

+105.0%

vs. Stacked Frames
(GTRL baseline)

Navigation Success Rates by Environment

Method	Maze	Pillars	Stairs	Pits	Overall
LSTM	70.3%	78.2%	33.1%	72.7%	63.5%
GRU	68.1%	73.6%	35.7%	66.7%	61.0%
SRU (Ours)	76.0%	81.0%	82.8%	75.6%	78.9%

SRU achieves 2.5x improvement in challenging stair environments requiring precise 3D spatial memory.

Comparison Against Navigation Baselines

SRU implicit memory outperforms both historical observation stacking and explicit mapping approaches.

GTRL Historical Observations 38.2%

EMHP Explicit Mapping + Path 60.4%

GTRL* + SRU Memory 66.3%

Ours SRU + Full Architecture 78.3%

Key Finding

GTRL → GTRL* +73%

Simply replacing stacked observations with SRU implicit memory yields 73% improvement — proving SRU is the key differentiator for long-range navigation.

Navigation Trajectory Comparison

Visual comparison of navigation policies with SRU vs LSTM memory across four different environment types.

SRU Navigation Trajectories

Maze: SRU successfully reroutes from dead-ends with spatial memory.

Pillars: SRU takes efficient, shorter paths by memorizing obstacle locations.

Stairs: SRU navigates 3D structures smoothly with precise spatial awareness.

Pits: SRU recalls and avoids hazards throughout navigation.

LSTM Navigation Trajectories

Maze: LSTM loops in dead-ends, struggling to backtrack effectively.

Pillars: LSTM navigates successfully but takes longer, less efficient paths.

Stairs: LSTM oscillates and struggles with 3D spatial memory.

Pits: LSTM fails to spatially register pit locations from varying perspectives.

Key Advantages

Unlimited Context Window

Unlike explicit memory with fixed horizons (~20m), SRU's implicit recurrent memory maintains >80% success rate for 50m+ distances and >70% for 120m trajectories.

Zero-Shot Real-World Transfer

Deployed on Unitree B2W with ZedX camera across indoor offices, outdoor terraces, and forests—no fine-tuning required.

Robust Dead-End Recovery

SRU policies successfully backtrack through dead-end corridors and adapt to dynamic environment changes, where standard LSTM loops indefinitely.

Real-World Deployment

Our mapless navigation policy achieves zero-shot transfer from simulation to diverse real-world environments — relying solely on ego-centric depth perception from a single forward-facing stereo camera. No explicit maps, no fine-tuning on real-world data. Deployed on Unitree B2W with ZedX stereo camera and NVIDIA Jetson AGX Orin.

70m+
Single Goal Distance
Beyond training range (30m max)

100m+
Single Mission Traverse
In complex forest terrain

0
Explicit Maps + Real Data
Pure ego-centric depth, zero-shot transfer

4+
Environment Types
Office, main hall, terrace, forest

Deployment Environments

Tested across four diverse real-world environments demonstrating robust zero-shot generalization.

🤖 Hardware & Setup

🦿

Robot Platform

Unitree B2W legged-wheel robot (50 Hz locomotion)

👁️

Perception

ZedX stereo camera (105° FoV, 10m range)

⚡

Compute

NVIDIA Jetson AGX Orin (5 Hz policy)

📍

Localization

LiDAR-based state estimation (DLIO)

Zero-Shot Transfer

Direct deployment from simulation with 100K+ synthetic pretraining — no real-world fine-tuning

Beyond Training Range

70m+ goals (2.3x training range) and 100m+ traversal in forest

Pure Ego-Centric Navigation

Forward-facing depth only — no maps or 360° sensing

Environment-Specific Results

🏢

Indoor Office

Navigates through narrow corridors
Avoids furniture and dynamic obstacles
Recovers from dead-ends autonomously
Adapts to environment changes in real-time

🏛️

Campus Main Hall

Large-scale multi-floor indoor navigation
Handles wide open spaces and long corridors
Navigates stairs between floors
Adapts to complex architectural features

🏞️

Outdoor Terrace

Handles stairs and elevation changes
Navigates uneven terrain surfaces
Robust to outdoor lighting variations
Avoids railings and structural obstacles

🌲

Forest Environment

Long-Range

70m+ goal distance (2x training range)
100m+ total traverse per mission
Navigates around trees and vegetation
Handles natural terrain variations

Demonstration Video

Code & Resources

The complete SRU implementation is organized into modular repositories, enabling researchers and practitioners to use components independently or together. All repositories are actively maintained and fully documented.

🧠 Core Algorithm

sru-pytorch-spatial-learning

Core

PyTorch implementation of Spatially-Enhanced Recurrent Units (SRU) modules with spatial memory mechanisms.

PyTorch Core Module Examples

GitHub

🎯 Training Pipeline

sru-navigation-learning

Training

End-to-end RL training framework for navigation networks with SRU+attention architecture. Includes hyperparameters and regularization techniques.

RL Training rsl_rl MDPO (DML)

GitHub

sru-navigation-sim

Environment

IsaacLab task extension with diverse navigation environments, depth observation with pretrained encoding, sparse reward settings, and terrain variations.

IsaacLab Simulation Environments

GitHub

👁️ Perception

sru-depth-pretraining

Perception

Large-scale self-supervised depth perception pretraining on 100K+ synthetic environments. Includes pre-trained models and sensor preprocessing pipelines.

Depth Perception Pre-trained Models Pretraining

GitHub

🤖 Deployment

sru-robot-deployment

Deployment

Real robot deployment code for Unitree B2W with ZedX camera. Includes ROS2 integration, Gazebo simulation, sim-to-real transfer techniques, inference pipelines, and hardware setup guides.

ROS2 Gazebo Sim Unitree B2W

GitHub

Getting Started

🔬 For Researchers

Use the complete training pipeline with IsaacLab simulation and RL framework to reproduce results or extend the method.

Install: sru-navigation-sim + sru-navigation-learning
Pretraining: sru-depth-pretraining
Train: rsl_rl framework

🚀 For Practitioners

Deploy pre-trained models on real robots with minimal setup using our deployment code.

Install: sru-robot-deployment
Load: Pre-trained models
Deploy: Zero-shot transfer

👨‍💻 For Developers

Extend SRU capabilities by integrating the core module into your own projects.

Install: pip install sru-pytorch-spatial-learning
Import: SRU modules
Integrate: Into your architectures

Framework Dependencies

Our code release leverages two powerful open-source frameworks for simulation and reinforcement learning:

🏗️ IsaacLab

A modular and extensible simulation framework for robotics research built on Nvidia Omniverse. We use IsaacLab for fast, physically-accurate navigation environment simulation with support for dynamic obstacle configurations and diverse terrain variations.

IsaacLab GitHub Documentation

⚙️ rsl_rl

An open-source reinforcement learning library from Robotic Systems Lab (RSL) at ETH Zurich optimized for legged locomotion and navigation tasks. We use rsl_rl for efficient PPO training with GPU acceleration and advanced feature normalization.

rsl_rl GitHub Documentation

Resources & Citation

📄 Paper

Read the full paper

IJRR Publication

📚 Citation

BibTeX citation for your references

Copy BibTeX

🔗 Links

Related projects and resources

GitHub Profile

BibTeX Citation

@article{yang2025sru,
  author = {Yang, Fan and Frivik, Per and Hoeller, David and Wang, Chen and Cadena, Cesar and Hutter, Marco},
  title = {Spatially-enhanced recurrent memory for long-range mapless navigation via end-to-end reinforcement learning},
  journal = {The International Journal of Robotics Research},
  year = {2025},
  doi = {10.1177/02783649251401926},
  url = {https://doi.org/10.1177/02783649251401926}
}

Spatially-Enhanced Recurrent Memory for Long-Range Mapless Navigation

via End-to-End Reinforcement Learning

Abstract

Key Contributions

Identifying RNN Spatial Limitations

Spatially-Enhanced Recurrent Units

Attention-Based Architecture

Zero-Shot Sim-to-Real Transfer

Methodology

The Problem: RNNs Struggle with Spatial Memory

Our Solution: Spatially-Enhanced Recurrent Units (SRUs)

Implicit Spatial Transformation

Perception & Attention Architecture for RL-Based Navigation

Two-Stage Spatial Attention

Pretrained Depth Encoder

Training Strategy

Network Architecture

Attention Visualization

Attention Shifts with Robot State

Zero-Shot Generalization Across Real-World Environments

Results & Performance

+23.5%

+29.6%

+105.0%

Navigation Success Rates by Environment

Comparison Against Navigation Baselines

Navigation Trajectory Comparison

Key Advantages

Unlimited Context Window

Zero-Shot Real-World Transfer

Robust Dead-End Recovery

Real-World Deployment

Deployment Environments

🤖 Hardware & Setup

Environment-Specific Results

Indoor Office

Campus Main Hall

Outdoor Terrace

Forest Environment

Demonstration Video

Code & Resources

🧠 Core Algorithm

sru-pytorch-spatial-learning

🎯 Training Pipeline

sru-navigation-learning

sru-navigation-sim

👁️ Perception

sru-depth-pretraining

🤖 Deployment

sru-robot-deployment

Getting Started

🔬 For Researchers

🚀 For Practitioners

👨‍💻 For Developers

Framework Dependencies

🏗️ IsaacLab

⚙️ rsl_rl

Resources & Citation

📄 Paper

📚 Citation

🔗 Links

BibTeX Citation