SuperMap

Open-Vocabulary 3D Semantic Mapping

SuperMap

SuperMap: A Spatio-Temporal SLAM System for Visual-Language Navigation

Carnegie Mellon University — AirLab

Code Paper

Abstract

Robotic navigation in human environments requires a spatio-temporal semantic representation that can reconcile open-vocabulary perception with long-term environmental changes. While foundation models provide strong zero-shot recognition, their predictions are intermittent and view-dependent, and naively integrating them into mapping pipelines leads to identity drift and stale semantics over time.

We present SuperMap, a 4D spatio-temporal mapping framework for language-guided navigation that integrates high-frequency geometric SLAM with asynchronous open-vocabulary perception. Our core contribution is a consistency-driven mapping engine that combines 3D-aware instance association and re-activation with a principled existence-and-label confidence update to maintain stable object identities and prune outdated map content under occlusions and scene changes.

SuperMap produces a queryable 4D scene-graph representation that interfaces naturally with Vision-Language Models by supporting compositional queries over object semantics, relations, and history. We demonstrate SuperMap on benchmarks and real robots, including dynamic scenes with appearance/disappearance and relocation, and provide ablations and runtime analysis. We will release the full system as open-source to provide the community with a deployable baseline for open-vocabulary spatio-temporal mapping.

Contributions

🗺️
Open-Vocabulary Spatio-Temporal SLAM
An online robotic system that builds a persistent, queryable open-vocabulary 4D scene memory suitable for downstream language-conditioned tasks — running fully onboard in real time.
🔄
Spatio-Temporal Object Tracking
An online pipeline that integrates 2D–3D association, validation, and change-aware updates to maintain instance consistency under occlusions, partial observations, label variability, and scene change.
🕸️
Instance-level Scene Graph
A 4D scene graph that seamlessly incorporates spatial and temporal information for each object, equipping robots with instance-level spatio-temporal reasoning — e.g., locating moved objects, recalling past scenes.
📖
Open-Source Framework
Full release of the change-detection benchmark, comprehensive ablations, runtime profiling, and the real-robot visual–language navigation pipeline to facilitate reproducible research.

System Architecture

1

Geometric Layer — Online 3D Reconstruction

SuperOdometry provides pose estimation and a colorized dense 3D model from synchronized RGB images, depth/LiDAR, and IMU streams. Geometric priors anchor all subsequent 2D–3D association and global map consistency checks.

2

Instance Layer — Spatio-Temporal Instance Association

Per-frame open-vocabulary detections (GroundingDINO + SAM2) are associated to existing 3D map objects via a hybrid 2D–3D tracker. A probabilistic geometric consistency update and Bayesian semantic fusion maintain stable object identities across long time horizons under occlusions and scene change.

3

Topological Layer — Abstract 4D Scene Graph

The object map is abstracted into a scene graph G = (V, Es, Et) with spatial edges (geometric predicates: on, beside, under) and temporal edges (object trajectory history). The graph is serialized as structured text for compositional VLM queries over object semantics, spatial relations, and temporal history.

3D Point Cloud Viewer

Left drag: rotate  |  Right drag / scroll: zoom  |  Middle drag: pan
Loading…

CMU campus segment 01 — 3D bounding boxes from SuperMap. Use Load PLY file to load a different point cloud.

Results

SuperMap is evaluated on the ScanNet benchmark against state-of-the-art semantic mapping and Semantic SLAM methods, and on real-robot deployments in dynamic indoor environments with object appearance and disappearance events.

Class-level Segmentation on ScanNet

Method Approach mIoU (%) f-mIoU (%) Acc (%)
ConceptGraphsobject-level21.6224.3231.05
HOV-SGobject-level26.7936.0535.17
SuperMap (Ours)object-level27.4243.5055.48

Instance-level Segmentation on ScanNet (mAP50)

Method Chair Window Refrigerator Sofa Door
HOV-SG4.580.000.0030.009.70
ConceptGraphs0.000.000.000.000.00
SuperMap (Ours)63.7642.2062.5033.3510.00

Spatio-Temporal Change Detection Recall

Method Appeared (Buc.) Appeared (Cart) Appeared (Sign) Disappeared (Plant) Disappeared (Trash) Disappeared (Chair)
Khronos
DualMap0.0000.0000.0000.3100.0000.000
SuperMap (Ours)1.0000.2620.5830.7550.4341.000

Citation

BibTeX

@inproceedings{supermap2026,
  title     = {SuperMap: A Spatio-Temporal SLAM System for Visual-Language Navigation},
  author    = {Anonymous},
  booktitle = {Robotics: Science and Systems (RSS)},
  year      = {2026}
}