Abstract
Robotic navigation in human environments requires a spatio-temporal semantic representation that can reconcile open-vocabulary perception with long-term environmental changes. While foundation models provide strong zero-shot recognition, their predictions are intermittent and view-dependent, and naively integrating them into mapping pipelines leads to identity drift and stale semantics over time.
We present SuperMap, a 4D spatio-temporal mapping framework for language-guided navigation that integrates high-frequency geometric SLAM with asynchronous open-vocabulary perception. Our core contribution is a consistency-driven mapping engine that combines 3D-aware instance association and re-activation with a principled existence-and-label confidence update to maintain stable object identities and prune outdated map content under occlusions and scene changes.
SuperMap produces a queryable 4D scene-graph representation that interfaces naturally with Vision-Language Models by supporting compositional queries over object semantics, relations, and history. We demonstrate SuperMap on benchmarks and real robots, including dynamic scenes with appearance/disappearance and relocation, and provide ablations and runtime analysis. We will release the full system as open-source to provide the community with a deployable baseline for open-vocabulary spatio-temporal mapping.
Contributions
Open-Vocabulary Spatio-Temporal SLAM
An online robotic system that builds a persistent, queryable open-vocabulary 4D scene memory suitable for downstream language-conditioned tasks — running fully onboard in real time.
Spatio-Temporal Object Tracking
An online pipeline that integrates 2D–3D association, validation, and change-aware updates to maintain instance consistency under occlusions, partial observations, label variability, and scene change.
Instance-level Scene Graph
A 4D scene graph that incorporates spatial and temporal information for each object, equipping robots with instance-level reasoning — e.g., locating moved objects, recalling past scenes.
Open-Source Framework
Full release of the change-detection benchmark, comprehensive ablations, runtime profiling, and the real-robot visual–language navigation pipeline to facilitate reproducible research.
System Architecture
Geometric Layer — Online 3D Reconstruction
SuperOdometry provides pose estimation and a colorized dense 3D model from synchronized RGB images, depth/LiDAR, and IMU streams. Geometric priors anchor all subsequent 2D–3D association and global map consistency checks.
Instance Layer — Spatio-Temporal Instance Association
Per-frame open-vocabulary detections (GroundingDINO + SAM2) are associated to existing 3D map objects via a hybrid 2D–3D tracker. A probabilistic geometric consistency update and Bayesian semantic fusion maintain stable object identities across long time horizons under occlusions and scene change.
Topological Layer — Abstract 4D Scene Graph
The object map is abstracted into a scene graph G = (V, Es, Et) with spatial edges (geometric predicates: on, beside, under) and temporal edges (object trajectory history). The graph is serialized as structured text for compositional VLM queries over object semantics, spatial relations, and history.
Interactive 3D Map — CMU Campus Segment 01
White: LiDAR point cloud (47k pts) | Yellow: SuperMap 3D bounding boxes
Results
SuperMap achieves competitive accuracy against state-of-the-art object-level mapping methods while running fully online.
| Method | Approach | mIoU (%) | f-mIoU (%) | Acc (%) |
|---|---|---|---|---|
| ConceptGraphs | object-level | 21.62 | 24.32 | 31.05 |
| HOV-SG | object-level | 26.79 | 36.05 | 35.17 |
| SuperMap (Ours) | object-level | 27.42 | 43.50 | 55.48 |
SuperMap significantly outperforms prior scene-graph methods on instance-level detection.
| Method | Chair | Window | Refrigerator | Sofa | Door |
|---|---|---|---|---|---|
| HOV-SG | 4.58 | 0.00 | 0.00 | 30.00 | 9.70 |
| ConceptGraphs | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| SuperMap (Ours) | 63.76 | 42.20 | 62.50 | 33.35 | 10.00 |
SuperMap achieves perfect recall on appearance events and strong recall on disappearance events, significantly outperforming prior methods.
| Method | Appeared (Bucket) | Appeared (Cart) | Appeared (Sign) | Disappeared (Plant) | Disappeared (Trash) | Disappeared (Chair) |
|---|---|---|---|---|---|---|
| Khronos | — | — | — | — | — | — |
| DualMap | 0.000 | 0.000 | 0.000 | 0.310 | 0.000 | 0.000 |
| SuperMap (Ours) | 1.000 | 0.262 | 0.583 | 0.755 | 0.434 | 1.000 |
Citation
@inproceedings{supermap2026,
title = {SuperMap: A Spatio-Temporal SLAM System for Visual-Language Navigation},
author = {Anonymous},
booktitle = {Robotics: Science and Systems (RSS)},
year = {2026}
}