Content

Abstract
Contributions
System Architecture
Interactive 3D Map
Results
Citation

SuperMap: A Spatio-Temporal SLAM System
for Visual-Language Navigation

Carnegie Mellon University — AirLab

Abstract

SuperMap

Robotic navigation in human environments requires a spatio-temporal semantic representation that can reconcile open-vocabulary perception with long-term environmental changes. While foundation models provide strong zero-shot recognition, their predictions are intermittent and view-dependent, and naively integrating them into mapping pipelines leads to identity drift and stale semantics over time.

We present SuperMap, a 4D spatio-temporal mapping framework for language-guided navigation that integrates high-frequency geometric SLAM with asynchronous open-vocabulary perception. Our core contribution is a consistency-driven mapping engine that combines 3D-aware instance association and re-activation with a principled existence-and-label confidence update to maintain stable object identities and prune outdated map content under occlusions and scene changes.

SuperMap produces a queryable 4D scene-graph representation that interfaces naturally with Vision-Language Models by supporting compositional queries over object semantics, relations, and history. We demonstrate SuperMap on benchmarks and real robots, including dynamic scenes with appearance/disappearance and relocation, and provide ablations and runtime analysis. We will release the full system as open-source to provide the community with a deployable baseline for open-vocabulary spatio-temporal mapping.

Contributions

Open-Vocabulary Spatio-Temporal SLAM

An online robotic system that builds a persistent, queryable open-vocabulary 4D scene memory suitable for downstream language-conditioned tasks — running fully onboard in real time.

Spatio-Temporal Object Tracking

An online pipeline that integrates 2D–3D association, validation, and change-aware updates to maintain instance consistency under occlusions, partial observations, label variability, and scene change.

Instance-level Scene Graph

A 4D scene graph that incorporates spatial and temporal information for each object, equipping robots with instance-level reasoning — e.g., locating moved objects, recalling past scenes.

Open-Source Framework

Full release of the change-detection benchmark, comprehensive ablations, runtime profiling, and the real-robot visual–language navigation pipeline to facilitate reproducible research.

System Architecture

Three-Layer Pipeline

1

Geometric Layer — Online 3D Reconstruction

SuperOdometry provides pose estimation and a colorized dense 3D model from synchronized RGB images, depth/LiDAR, and IMU streams. Geometric priors anchor all subsequent 2D–3D association and global map consistency checks.

2

Instance Layer — Spatio-Temporal Instance Association

Per-frame open-vocabulary detections (GroundingDINO + SAM2) are associated to existing 3D map objects via a hybrid 2D–3D tracker. A probabilistic geometric consistency update and Bayesian semantic fusion maintain stable object identities across long time horizons under occlusions and scene change.

3

Topological Layer — Abstract 4D Scene Graph

The object map is abstracted into a scene graph G = (V, E_s, E_t) with spatial edges (geometric predicates: on, beside, under) and temporal edges (object trajectory history). The graph is serialized as structured text for compositional VLM queries over object semantics, spatial relations, and history.

Interactive 3D Map Explorer — CMU Campus

Explore the merged, semantically-annotated 3D map (drag to orbit, scroll to zoom, right-drag to pan; hover a box for its label, click to pin). Hover a replay card below to light up the robot's trajectory and the region it mapped; click to play the dynamic reconstruction.

Merged 3D Map

loading…

point size

drag: orbit · scroll: zoom · right-drag: pan · hover a card to highlight

Loading map… (large, please be patient)

Dynamic replays — hover to locate, click to play:

Results

Class-level Segmentation — ScanNet

SuperMap achieves competitive accuracy against state-of-the-art object-level mapping methods while running fully online.

Method	Approach	mIoU (%)	f-mIoU (%)	Acc (%)
ConceptGraphs	object-level	21.62	24.32	31.05
HOV-SG	object-level	26.79	36.05	35.17
SuperMap (Ours)	object-level	27.42	43.50	55.48

Instance-level Segmentation — ScanNet (mAP₅₀)

SuperMap significantly outperforms prior scene-graph methods on instance-level detection.

Method	Chair	Window	Refrigerator	Sofa	Door
HOV-SG	4.58	0.00	0.00	30.00	9.70
ConceptGraphs	0.00	0.00	0.00	0.00	0.00
SuperMap (Ours)	63.76	42.20	62.50	33.35	10.00

Spatio-Temporal Change Detection Recall

SuperMap achieves perfect recall on appearance events and strong recall on disappearance events, significantly outperforming prior methods.

Method	Appeared (Bucket)	Appeared (Cart)	Appeared (Sign)	Disappeared (Plant)	Disappeared (Trash)	Disappeared (Chair)
Khronos	—	—	—	—	—	—
DualMap	0.000	0.000	0.000	0.310	0.000	0.000
SuperMap (Ours)	1.000	0.262	0.583	0.755	0.434	1.000

Citation

@inproceedings{supermap2026,
  title     = {SuperMap: A Spatio-Temporal SLAM System for Visual-Language Navigation},
  author    = {Anonymous},
  booktitle = {Robotics: Science and Systems (RSS)},
  year      = {2026}
}