RNNs can capture time, but can they capture space?

Spatially-Enhanced Recurrent Memory for Long-Range Mapless Navigation

via End-to-End Reinforcement Learning

Fan Yang1,*, Per Frivik1, David Hoeller1, Chen Wang2, Cesar Cadena1, Marco Hutter1

1Robotic Systems Lab, ETH Zurich  •  2Spatial AI & Robotics Lab, University at Buffalo

*Corresponding Author: fanyang1@ethz.ch

International Journal of Robotics Research (IJRR) arXiv:2506.05997 June 2025

Abstract

Recent advancements in robot navigation, particularly with end-to-end learning approaches such as reinforcement learning (RL), have demonstrated strong performance. However, successful navigation still depends on two key capabilities: mapping and planning (explicitly or implicitly). Classical approaches rely on explicit mapping pipelines to register egocentric observations into a coherent map. In contrast, end-to-end learning often achieves this implicitly through recurrent neural networks (RNNs) that fuse current and historical observations into a latent space for planning.

While existing architectures, such as LSTM and GRU, can capture temporal dependencies, our findings reveal a critical limitation: their inability to effectively perform spatial memorization. This capability is essential for integrating sequential observations from varying perspectives to build spatial representations that support planning. To address this, we propose Spatially-Enhanced Recurrent Units (SRUs) β€” a simple yet effective modification to existing RNNs that enhance spatial memorization.

We further introduce an attention-based network architecture integrated with SRUs, enabling long-range mapless navigation using a single forward-facing stereo camera. We also employ regularization techniques to facilitate robust end-to-end recurrent training via RL. Experimental results show 23.5% overall improvement in long-range navigation compared to existing RNNs. With SRU memory, our method outperforms RL baselines by 29.6% and 105.0% respectively, across diverse environments requiring long-horizon mapping and memorization. Finally, we address the sim-to-real gap by leveraging large-scale pretraining on synthetic depth data, enabling zero-shot transfer for deployment across diverse and complex real-world environments.

Key Contributions

🧠

Identifying RNN Spatial Limitations

We reveal that standard RNNs (LSTM, GRU, S4, Mamba-SSM), while excelling at temporal dependencies, fundamentally struggle with spatial memorization from varying perspectives.

πŸ”§

Spatially-Enhanced Recurrent Units

A simple yet effective modification using element-wise multiplication ("star operation") that enables implicit spatial transformation learning from egocentric observations.

🎯

Attention-Based Architecture

Two-stage spatial attention (self + cross-attention) compresses high-dimensional visual features into task-relevant spatial cues for efficient recurrent memory.

🌍

Zero-Shot Sim-to-Real Transfer

Large-scale pretraining with parallelized depth-noise augmentation enables deployment across offices, terraces, and forests without fine-tuning.

Methodology

The Problem: RNNs Struggle with Spatial Memory

While standard RNNs (LSTM, GRU) excel at capturing temporal dependencies, our findings reveal a critical limitation: they struggle to effectively perform spatial memorizationβ€”the ability to transform and integrate sequential observations from varying perspectives into coherent spatial representations.

Temporal Memorization Loss

(a) Temporal memorization: Both LSTM and SRU converge quickly on temporal sequences.

Spatial Memorization Loss

(b) Spatial memorization: SRU converges effectively while LSTM struggles to learn spatial transformations.

LSTM Spatial Mapping Results

(c) LSTM spatial mapping: Fails to accurately register and memorize landmark positions from earlier observations.

SRU Spatial Mapping Results

(d) SRU spatial mapping: Successfully transforms and memorizes landmark positions across varying perspectives.

Our Solution: Spatially-Enhanced Recurrent Units (SRUs)

We propose a simple yet effective modification to standard LSTM and GRU architectures by introducing an additional spatial transformation operation. Inspired by homogeneous transformations and the "star operation" (element-wise multiplication), SRUs enable the network to implicitly learn spatial transformations from sequences of egocentric observations.

Implicit Spatial Transformation

An additional learnable transformation term aligns and memorizes observations from varying perspectives while preserving temporal memorization capabilities.

Perception & Attention Architecture for RL-Based Navigation

Two-Stage Spatial Attention

Self-attention enriches visual features with global context, while cross-attention (queried by robot state and goal) compresses high-dimensional features into task-relevant spatial cues, enabling end-to-end navigation learning directly from ego-centric observations.

Pretrained Depth Encoder

A RegNet + FPN backbone pretrained on large-scale synthetic depth data (TartanAir) with a parallelized depth-noise model enables end-to-end navigation from ego-centric depth alone, achieving robust zero-shot sim-to-real transfer without explicit mapping.

Training Strategy

  • Sparse Rewards: Time-based rewards at episode end with random early checks to promote exploration without intermediate reward distractions
  • Deep Mutual Learning (DML): Two policies trained in parallel with KL-divergence distillation to prevent early overfitting and encourage robust spatial-temporal feature learning
  • Temporally Consistent Dropout: Consistent dropout masks across time steps during rollout and training for stable recurrent memory learning

Network Architecture

SRU Network Architecture

End-to-end architecture: Pretrained depth encoder β†’ Self-attention β†’ Cross-attention β†’ SRU β†’ MLP with TC-Dropout β†’ Velocity commands

Attention Visualization

Our cross-attention mechanism learns to focus on task-relevant spatial features based on the robot's current state β€” without any explicit supervision. This emergent behavior demonstrates the policy's ability to selectively attend to obstacles and navigable space in real-time.

Attention Shifts with Robot State

When the robot's movement direction changes, the attention weights automatically shift to emphasize spatial regions relevant to the new heading. Different colors represent distinct attention heads.

Turning Left
Attention weights when robot turns left

Left Turn: Attention highlights the left region, focusing on obstacles in the new heading direction.

Going Straight
Attention weights when robot goes straight

Straight: Attention emphasizes the central region, capturing obstacles directly ahead.

Turning Right
Attention weights when robot turns right

Right Turn: Attention focuses on the right region, concentrating on obstacles in the turn direction.

πŸ’‘
Key Insight: These attention patterns emerge naturally during end-to-end reinforcement learning β€” no explicit attention supervision is required. The network learns to extract task-relevant spatial cues autonomously.

Zero-Shot Generalization Across Real-World Environments

The learned attention mechanism generalizes to diverse real-world scenarios without any fine-tuning, dynamically highlighting relevant spatial features for navigation.

Indoor Office
Attention weights in office environment

Attention focuses on furniture, walls, and doorways for indoor navigation.

Outdoor Terrace
Attention weights in terrace environment

Attention adapts to stairs, railings, and varied terrain structures.

Forest
Attention weights in forest environment

Attention identifies trees, vegetation, and natural obstacles for long-range navigation.

Results & Performance

+23.5%

vs. Standard RNNs
(LSTM, GRU)

+29.6%

vs. Explicit Mapping
(EMHP baseline)

+105.0%

vs. Stacked Frames
(GTRL baseline)

Navigation Success Rates by Environment

Method Maze Pillars Stairs Pits Overall
LSTM 70.3% 78.2% 33.1% 72.7% 63.5%
GRU 68.1% 73.6% 35.7% 66.7% 61.0%
SRU (Ours) 76.0% 81.0% 82.8% 75.6% 78.9%

SRU achieves 2.5x improvement in challenging stair environments requiring precise 3D spatial memory.

Comparison Against Navigation Baselines

SRU implicit memory outperforms both historical observation stacking and explicit mapping approaches.

GTRL Historical Observations 38.2%
EMHP Explicit Mapping + Path 60.4%
GTRL* + SRU Memory 66.3%
Ours SRU + Full Architecture 78.3%
Key Finding
GTRL β†’ GTRL* +73%

Simply replacing stacked observations with SRU implicit memory yields 73% improvement β€” proving SRU is the key differentiator for long-range navigation.

Navigation Trajectory Comparison

Visual comparison of navigation policies with SRU vs LSTM memory across four different environment types.

SRU Navigation Trajectories across Four Environments

SRU Navigation Trajectories

Maze: SRU successfully reroutes from dead-ends with spatial memory.

Pillars: SRU takes efficient, shorter paths by memorizing obstacle locations.

Stairs: SRU navigates 3D structures smoothly with precise spatial awareness.

Pits: SRU recalls and avoids hazards throughout navigation.

LSTM Navigation Trajectories across Four Environments

LSTM Navigation Trajectories

Maze: LSTM loops in dead-ends, struggling to backtrack effectively.

Pillars: LSTM navigates successfully but takes longer, less efficient paths.

Stairs: LSTM oscillates and struggles with 3D spatial memory.

Pits: LSTM fails to spatially register pit locations from varying perspectives.

Key Advantages

Unlimited Context Window

Unlike explicit memory with fixed horizons (~20m), SRU's implicit recurrent memory maintains >80% success rate for 50m+ distances and >70% for 120m trajectories.

Zero-Shot Real-World Transfer

Deployed on Unitree B2W with ZedX camera across indoor offices, outdoor terraces, and forestsβ€”no fine-tuning required.

Robust Dead-End Recovery

SRU policies successfully backtrack through dead-end corridors and adapt to dynamic environment changes, where standard LSTM loops indefinitely.

Real-World Deployment

Our mapless navigation policy achieves zero-shot transfer from simulation to diverse real-world environments β€” relying solely on ego-centric depth perception from a single forward-facing stereo camera. No explicit maps, no fine-tuning on real-world data. Deployed on Unitree B2W with ZedX stereo camera and NVIDIA Jetson AGX Orin.

70m+
Single Goal Distance
Beyond training range (30m max)
100m+
Single Mission Traverse
In complex forest terrain
0
Explicit Maps + Real Data
Pure ego-centric depth, zero-shot transfer
4+
Environment Types
Office, main hall, terrace, forest

Deployment Environments

Tested across four diverse real-world environments demonstrating robust zero-shot generalization.

πŸ€– Hardware & Setup

🦿
Robot Platform

Unitree B2W legged-wheel robot (50 Hz locomotion)

πŸ‘οΈ
Perception

ZedX stereo camera (105Β° FoV, 10m range)

⚑
Compute

NVIDIA Jetson AGX Orin (5 Hz policy)

πŸ“
Localization

LiDAR-based state estimation (DLIO)

Zero-Shot Transfer

Direct deployment from simulation with 100K+ synthetic pretraining β€” no real-world fine-tuning

Beyond Training Range

70m+ goals (2.3x training range) and 100m+ traversal in forest

Pure Ego-Centric Navigation

Forward-facing depth only β€” no maps or 360Β° sensing

Environment-Specific Results

🏒

Indoor Office

Office Environment Navigation
  • Navigates through narrow corridors
  • Avoids furniture and dynamic obstacles
  • Recovers from dead-ends autonomously
  • Adapts to environment changes in real-time
πŸ›οΈ

Campus Main Hall

Main Hall Environment Navigation
  • Large-scale multi-floor indoor navigation
  • Handles wide open spaces and long corridors
  • Navigates stairs between floors
  • Adapts to complex architectural features
🏞️

Outdoor Terrace

Terrace Environment Navigation
  • Handles stairs and elevation changes
  • Navigates uneven terrain surfaces
  • Robust to outdoor lighting variations
  • Avoids railings and structural obstacles
🎯
Key Achievement: SRU successfully navigates through dead-end corridors, backtracks, and finds alternative routes β€” capabilities that standard LSTM-based policies fail to achieve due to limited spatial memory.

Demonstration Video

Code & Resources

The complete SRU implementation is organized into modular repositories, enabling researchers and practitioners to use components independently or together. All repositories are actively maintained and fully documented.

🧠 Core Algorithm

sru-pytorch-spatial-learning

Core

PyTorch implementation of Spatially-Enhanced Recurrent Units (SRU) modules with spatial memory mechanisms.

PyTorch Core Module Examples

🎯 Training Pipeline

sru-navigation-learning

Training

End-to-end RL training framework for navigation networks with SRU+attention architecture. Includes hyperparameters and regularization techniques.

RL Training rsl_rl MDPO (DML)

sru-navigation-sim

Environment

IsaacLab task extension with diverse navigation environments, depth observation with pretrained encoding, sparse reward settings, and terrain variations.

IsaacLab Simulation Environments

πŸ‘οΈ Perception

sru-depth-pretraining

Perception

Large-scale self-supervised depth perception pretraining on 100K+ synthetic environments. Includes pre-trained models and sensor preprocessing pipelines.

Depth Perception Pre-trained Models Pretraining

πŸ€– Deployment

sru-robot-deployment

Deployment

Real robot deployment code for Unitree B2W with ZedX camera. Includes ROS2 integration, Gazebo simulation, sim-to-real transfer techniques, inference pipelines, and hardware setup guides.

ROS2 Gazebo Sim Unitree B2W

Getting Started

πŸ”¬ For Researchers

Use the complete training pipeline with IsaacLab simulation and RL framework to reproduce results or extend the method.

  • Install: sru-navigation-sim + sru-navigation-learning
  • Pretraining: sru-depth-pretraining
  • Train: rsl_rl framework

πŸš€ For Practitioners

Deploy pre-trained models on real robots with minimal setup using our deployment code.

  • Install: sru-robot-deployment
  • Load: Pre-trained models
  • Deploy: Zero-shot transfer

πŸ‘¨β€πŸ’» For Developers

Extend SRU capabilities by integrating the core module into your own projects.

  • Install: pip install sru-pytorch-spatial-learning
  • Import: SRU modules
  • Integrate: Into your architectures

Framework Dependencies

Our code release leverages two powerful open-source frameworks for simulation and reinforcement learning:

πŸ—οΈ IsaacLab

A modular and extensible simulation framework for robotics research built on Nvidia Omniverse. We use IsaacLab for fast, physically-accurate navigation environment simulation with support for dynamic obstacle configurations and diverse terrain variations.

βš™οΈ rsl_rl

An open-source reinforcement learning library from Robotic Systems Lab (RSL) at ETH Zurich optimized for legged locomotion and navigation tasks. We use rsl_rl for efficient PPO training with GPU acceleration and advanced feature normalization.

Resources & Citation

πŸ“„ Paper

Read the full paper

IJRR Publication

πŸ“š Citation

BibTeX citation for your references

Copy BibTeX

πŸ”— Links

Related projects and resources

GitHub Profile