SuperMap: A Spatio-Temporal SLAM System for Visual-Language Navigation
Carnegie Mellon University — AirLab
Paper
Abstract
Robotic navigation in human environments requires a spatio-temporal semantic representation that can reconcile open-vocabulary perception with long-term environmental changes. While foundation models provide strong zero-shot recognition, their predictions are intermittent and view-dependent, and naively integrating them into mapping pipelines leads to identity drift and stale semantics over time.
We present SuperMap, a 4D spatio-temporal mapping framework for language-guided navigation that integrates high-frequency geometric SLAM with asynchronous open-vocabulary perception. Our core contribution is a consistency-driven mapping engine that combines 3D-aware instance association and re-activation with a principled existence-and-label confidence update to maintain stable object identities and prune outdated map content under occlusions and scene changes.
SuperMap produces a queryable 4D scene-graph representation that interfaces naturally with Vision-Language Models by supporting compositional queries over object semantics, relations, and history. We demonstrate SuperMap on benchmarks and real robots, including dynamic scenes with appearance/disappearance and relocation, and provide ablations and runtime analysis. We will release the full system as open-source to provide the community with a deployable baseline for open-vocabulary spatio-temporal mapping.
Contributions
System Architecture
Geometric Layer — Online 3D Reconstruction
SuperOdometry provides pose estimation and a colorized dense 3D model from synchronized RGB images, depth/LiDAR, and IMU streams. Geometric priors anchor all subsequent 2D–3D association and global map consistency checks.
Instance Layer — Spatio-Temporal Instance Association
Per-frame open-vocabulary detections (GroundingDINO + SAM2) are associated to existing 3D map objects via a hybrid 2D–3D tracker. A probabilistic geometric consistency update and Bayesian semantic fusion maintain stable object identities across long time horizons under occlusions and scene change.
Topological Layer — Abstract 4D Scene Graph
The object map is abstracted into a scene graph G = (V, Es, Et) with spatial edges (geometric predicates: on, beside, under) and temporal edges (object trajectory history). The graph is serialized as structured text for compositional VLM queries over object semantics, spatial relations, and temporal history.
3D Point Cloud Viewer
CMU campus segment 01 — 3D bounding boxes from SuperMap. Use Load PLY file to load a different point cloud.
Results
SuperMap is evaluated on the ScanNet benchmark against state-of-the-art semantic mapping and Semantic SLAM methods, and on real-robot deployments in dynamic indoor environments with object appearance and disappearance events.
Class-level Segmentation on ScanNet
| Method | Approach | mIoU (%) | f-mIoU (%) | Acc (%) |
|---|---|---|---|---|
| ConceptGraphs | object-level | 21.62 | 24.32 | 31.05 |
| HOV-SG | object-level | 26.79 | 36.05 | 35.17 |
| SuperMap (Ours) | object-level | 27.42 | 43.50 | 55.48 |
Instance-level Segmentation on ScanNet (mAP50)
| Method | Chair | Window | Refrigerator | Sofa | Door |
|---|---|---|---|---|---|
| HOV-SG | 4.58 | 0.00 | 0.00 | 30.00 | 9.70 |
| ConceptGraphs | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| SuperMap (Ours) | 63.76 | 42.20 | 62.50 | 33.35 | 10.00 |
Spatio-Temporal Change Detection Recall
| Method | Appeared (Buc.) | Appeared (Cart) | Appeared (Sign) | Disappeared (Plant) | Disappeared (Trash) | Disappeared (Chair) |
|---|---|---|---|---|---|---|
| Khronos | — | — | — | — | — | — |
| DualMap | 0.000 | 0.000 | 0.000 | 0.310 | 0.000 | 0.000 |
| SuperMap (Ours) | 1.000 | 0.262 | 0.583 | 0.755 | 0.434 | 1.000 |
Citation
BibTeX
@inproceedings{supermap2026,
title = {SuperMap: A Spatio-Temporal SLAM System for Visual-Language Navigation},
author = {Anonymous},
booktitle = {Robotics: Science and Systems (RSS)},
year = {2026}
}