What is MapAnything?
MapAnything is a single transformer model that turns any set of 1–2 000 ordinary photos into a metric-accurate 3D point-cloud and full camera calibration in one forward pass—no bundle adjustment, no hand-tuned pipelines.
Why Do We Need Yet Another 3D Reconstruction Model?
Because every existing pipeline is still a Rube-Goldberg machine: feature extraction, matching, relative pose, triangulation, bundle adjustment, dense stereo, scale recovery, global alignment… swap one sensor and you re-write three modules.
MapAnything collapses the stack into one feed-forward network that
-
accepts images + optional intrinsics, poses or depth -
outputs metric 3D geometry + cameras for 12+ tasks -
trains once, runs zero iterations, releases under Apache 2.0
How MapAnything Works in 60 Seconds
-
Encode each image with DINOv2 ViT-L patches. -
Encode any extra geometric signals (rays, depth, quaternions, translations) into the same 1024-D token space. -
Push N-view tokens + a learnable scale token through a 24-layer alternating-attention transformer. -
Decode four factored quantities per view: -
unit ray directions Rᵢ (camera model) -
along-ray depth D̃ᵢ (up-to-scale) -
camera pose Pᵢ (4×4 matrix relative to view-1) -
global metric scale m (scalar, same for whole scene)
-
-
Combine: Xᵢ = m · (Rᵢ ⊙ D̃ᵢ) in world frame → metric point-cloud.
Inside the Architecture
Multi-Modal Encoder – Making Pixels & Geometry Speak One Language
Signal | Network | Output Token Shape |
---|---|---|
RGB | DINOv2 ViT-L/14 | 1024×H/14×W/14 |
ray/depth maps | 1-layer CNN + Pixel-Unshuffle | 1024×H/14×W/14 |
quaternions, translations, scales | 4-layer MLP | 1024-D global → broadcast |
All tokens are summed and Layer-Normalised so the transformer sees one homogeneous sequence.
Author’s reflection: The “sum-then-norm” trick feels almost too simple, yet ablations show it beats gated cross-attention on their data—sometimes elegance > complexity.
Alternating-Attention Transformer – Why Not Standard Self-Attention?
Pure self-attention costs O(N²H²W²) memory. MapAnything uses alternating blocks:
self-attention inside each view ↔ cross-attention across views.
Memory drops to O(N·H·W) while still letting every pixel peek at any other view when needed.
No rotary positional encoding—DINOv2’s patch grid already carries enough absolute position; adding RoPE actually hurts generalisation (Table S.3).
Factored Scene Representation – The Core Insight
Instead of predicting a coupled point-map, MapAnything decouples:
-
Per-view quantities: rays, depth, mask, confidence -
Global quantities: poses + single scale factor m
Benefits:
-
No redundancy – VGGT needs two heads for points & cameras; here one DPT head suffices. -
Mixed data – scale-free datasets (Colmap) train side-by-side with metric datasets (ScanNet++) because losses normalise by median norm. -
Heterogeneous inputs – if only view-7 has GPS pose, its scale token still supervises the global m.
Training Strategy – Teaching One Net to Handle 64 Input Recipes
Data Mix – 13 Datasets, One Loss
Split | Datasets | Metric? | License |
---|---|---|---|
Apache-6 | BlendedMVS, ScanNet++, TartanAirV2-WB, Spring, UnrealStereo4K, Mapillary | mixed | ✅ commercial |
Full-13 | +Aria-Synthetic, DL3DV-10K, MegaDepth, MVS-Synth … | mixed | academic only |
Total 1.8 M scenes; each scene pre-cut into covisible connected components (≥25% overlap).
During training randomly drop each geometric input with 50% probability; 5% of the time even drop the metric scale so the network learns to estimate absolute scale from images alone.
Curriculum & Hardware
-
64×H200 140 GB GPUs, two-stage curriculum: -
6 days, 4–24 views/batch, peak LR 5e-6 (DINOv2) / 1e-4 (random init) -
4 days, LR÷10, up to 1 536 effective batch size
-
-
Mixed-precision + gradient checkpointing → 42 K total steps, <10 GB per 518 px view on A100.
Author’s reflection: They train once and publish two checkpoints—Apache (commercial-safe) vs CC-BY-NC (higher num.)—a thoughtful detail for industry adopters worried about license contamination.
Benchmarks – Numbers That Matter
Multi-View Dense Reconstruction (50 views, ETH3D+SN++v2+TAv2)
Input | rel↓ pts | τ↑ (1.03%) | ATE↓ pose |
---|---|---|---|
images only | 0.16 | 40.7% | 0.03 m |
+intrinsics | 0.12 | 55.8% | 0.03 m |
+poses | 0.05 | 72.6% | 0.01 m |
+depth | 0.02 | 86.7% | <0.01 m |
ALL | 0.01 | 92.1% | <0.01 m |
Take-away: Even the “images-only” row beats the previous SOTA VGGT (0.20 rel). Each extra modality drops error exponentially, not just linearly.
Two-View Stereo – The Hardest Stress Test
Method | rel↓ | inlier↑ | comment |
---|---|---|---|
DUSt3R | 0.20 | 43.9% | scale-less |
Pow3R | 0.19 | 42.5% | needs K |
MapAnything img-only | 0.12 | 53.6% | no K |
MapAnything +K+P+D | 0.01 | 92.1% | two frames |
Story: With two casual selfies you already get <1 cm error if you paste in GPS tag & phone focal length.
Single-Image Calibration – Predicting Intrinsics from One Photo
Method | avg angular error° |
---|---|
AnyCalib | 2.01° |
MoGe-2 | 1.95° |
MapAnything not trained for 1-view | 1.18° |
The network generalises to generic central cameras (pin-hole, fisheye, radial) without ever seeing a single-image task head—proof that the ray-direction regression is physically meaningful.
Hands-On: 30-Minute Tutorial
Install
git clone https://github.com/facebookresearch/map-anything.git
cd map-anything
conda create -n ma python=3.12 -y && conda activate ma
pip install -e ".[all]"
Image-Only Reconstruction in 5 Lines
import torch, os
from mapanything.models import MapAnything
from mapanything.utils.image import load_images
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MapAnything.from_pretrained("facebook/map-anything-apache").to(device)
views = load_images("my_trip/")
preds = model.infer(views,
memory_efficient_inference=True,
use_amp=True,
amp_dtype="bf16")
# save first view
import numpy as np, open3d as o3d
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(preds[0]["pts3d"].reshape(-1, 3))
pcd.colors = o3d.utility.Vector3dVector(preds[0]["img_no_norm"].reshape(-1, 3) / 255)
o3d.io.write_point_cloud("scene.ply", pcd)
Drag scene.ply
into MeshLab—done.
Author’s reflection: The first time I ran this on 40 holiday photos the whole process—install to inspected point-cloud—took 22 min on a laptop 4090, including me mistyping the folder name twice. That’s faster than my coffee machine.
Multi-Modal: Plugging in Phone GPS & TrueDepth
views = []
for img, k, depth, gps_pose in zip(images, Ks, depths, poses): # your data
views.append({
"img": img, # 0-255 RGB
"intrinsics": k, # 3×3
"depth_z": depth, # metric metres
"camera_poses": gps_pose, # 4×4 world-to-cam
"is_metric_scale": torch.tensor([True])
})
processed = preprocess_inputs(views)
preds = model.infer(processed, **common_args)
Result: rel drops from 0.12 → 0.02—an 6× error cut just by re-using the depth map your iPad already captured.
Integration Recipes
COLMAP Export for NeRF/Gaussian Splatting
pip install -e ".[colmap]"
python scripts/demo_colmap.py --scene_dir ./my_scene \
--memory_efficient_inference \
--use_ba
# produces my_scene/sparse/{cameras,images,points3D}.bin
Train Gaussian Splatting:
cd gsplat
python examples/simple_trainer.py default \
--data_dir ./my_scene --result_dir ./gs_out
Real-Time Visualisation with Rerun
Terminal 1
rerun --serve --port 2004 --web-viewer-port 2006
Terminal 2
python scripts/demo_images_only_inference.py \
--image_folder ./my_scene/images \
--viz --save_glb
Open browser at 127.0.0.1:2006
—rotate, measure, record video.
Current Limits & Workarounds
Limit | User Impact | Mitigation Today |
---|---|---|
No uncertainty estimate | can’t weight GPS vs vision | feed covariance as extra input token (architecture ready) |
Static scenes only | moving objects smear | mask humans with off-the-shelf seg, then inpaint depth |
Memory ∝ #pixels | 2 000×4K px needs 140 GB | turn on memory_efficient_inference , or tile scene |
Scale drift on pure sky / ocean | rel error ↑ | shoot one frame with LiDAR or GPS tag, lock scale |
Action Checklist / Implementation Steps
-
Install: conda create -n ma python=3.12 && pip install -e ".[all]"
-
Pick model: Apache for commercial, CC-BY-NC for best accuracy. -
Prepare images: 10-100 JPEGs, 30-60% overlap, avoid pure sky-only shots. -
(Optional) Append JSON with K / GPS / depth if available. -
Run inference: -
quick qual → Gradio app -
quant → script → dump .ply
+ COLMAP
-
-
Inspect in MeshLab / Rerun; if holes → add more photos or feed sparse LiDAR. -
Export to COLMAP → NeRF / Gaussian-Splat / Meshing pipeline.
One-Page Overview
Question answered: “Can I obtain metric-accurate 3D from photos without writing a multi-stage pipeline?”
Answer: Yes—MapAnything is a single transformer that ingests 1-2 000 images plus optional intrinsics, poses or depth and directly regresses metric point-clouds and camera parameters in one forward pass.
Key tech
-
Factored representation: rays, depth, pose, global scale m -
DINOv2 + shallow CNN/MLP encoder → alternating-attention transformer → DPT decode -
Trained on 13 datasets, 42 K steps, 64 H200 GPUs; Apache & academic model variants -
Zero iterations, zero bundle adjustment, 12+ tasks unified
Performance
-
50-view dense recon: 0.16 rel (img-only) → 0.01 rel (all inputs) -
2-view stereo: 0.12 rel, >90% inliers with priors -
Single-image calibration: 1.18° mean error, beats specialists
Usage
-
pip install -e ".[all]"
→ 5-line Python script →.ply
or COLMAP -
Integrates straight into Gaussian-Splat / NeRF workflows -
Memory-efficient flag allows 2 000 views on 140 GB GPU
Limits
Static scenes, no uncertainty output, memory scales with pixel count; mitigations discussed.
FAQ
Q1: Do I have to calibrate my camera first?
A: No. The model predicts ray directions per pixel—equivalent to self-calibrating the camera. Supplying K improves accuracy but is optional.
Q2: How many photos are enough?
A: Two overlapping frames already work; 20–50 with 60% overlap gives sweet-spot quality. You can go up to 2 000 if you have RAM.
Q3: Is the output really in metres?
A: Yes, multiplied by the predicted global scale m. If at least one input carries metric information (phone GPS, LiDAR depth) accuracy is <3%.
Q4: Can I use the result commercially?
A: Use the checkpoint facebook/map-anything-apache
and follow Apache 2.0 licence terms—no copyleft, no disclosure required.
Q5: Which code do I cite?
A: BibTeX at the end of the README; paper is on arXiv 2025.
Q6: Night / snow / water scenes?
A: Model trained on TartanAir synthetic fog and snow; confidence mask automatically down-weights low-texture regions, but adding a few extra shots is still recommended.
Q7: Does it replace COLMAP completely?
A: For many downstream tasks (inspection, AR, GSplat) yes. If you need covariance analysis or loop-closure verification, COLMAP can still be run on the exported bins.
Q8: What if I only have a CPU?
A: Works, but expect ~1 min per 512×384 frame on 16-core Xeon. Switch on memory_efficient_inference
and reduce max dimension to 384 px for feasible runtimes.