UniT

Overview

We present UniT, a unified feed-forward model that reformulates a wide range of geometry perception capabilities into a single framework, covering diverse view configurations, modality combinations, metric-scale perception, and long-horizon scalability. It supports both online and offline inference over an arbitrary number of views, flexibly incorporates auxiliary modalities such as camera parameters and depth maps, recovers geometry in metric scale measured in meters, and maintains bounded complexity over long horizons in in-the-wild environments.

Results

UniT is evaluated on 10 benchmark datasets, covering 7 representative geometry perception tasks, against 6 recent and representative baselines, under both scale-invariant and metric-scale settings.

Benchmarks: 7-Scenes, NRGBD, DTU, TUM-Dynamic, ScanNetV2, Sintel, KITTI, NYUv2, ETH3D, and Bonn.
Tasks: Multi-View Reconstruction, Camera Pose Estimation, Video Depth Estimation, Monocular Depth Estimation, Long-Horizon Perception, Multi-Modal Reconstruction, and Depth Completion.
Baselines: VGGT (CVPR 2025), π³ (ICLR 2026), MapAnything (3DV 2026), DepthAnything3 (ICLR 2026), CUT3R (CVPR 2025), and StreamVGGT (ICLR 2026).

A. Multi-View Reconstruction

Table III multi-view reconstruction results — Multi-view reconstruction is evaluated on the scene-level real-world 7-Scenes and synthetic NRGBD datasets, as well as the object-centric DTU dataset.

B. Camera Pose Estimation

Table IV camera pose estimation results — Camera pose estimation is conducted on the synthetic outdoor Sintel dataset and the real-world indoor TUM-Dynamic and ScanNetv2 datasets.

C. Video Depth Estimation

Table V video depth estimation results — Video depth estimation is evaluated on Sintel and the real-world Bonn and ETH3D datasets.

D. Monocular Depth Estimation

Table VI monocular depth estimation results — Monocular depth estimation is assessed on Sintel, and the widely used KITTI and NYUv2 datasets.

E. Long-Horizon Perception

F. Multi-Modal Reconstruction

Table VII multi-modal reconstruction results — Multi-modal reconstruction includes arbitrary combinations of depth maps, camera intrinsics, and extrinsics on 7-Scenes, ETH3D, and ScanNetv2 datasets.

G. Depth Completion

Table VIII depth completion results — Depth completion evaluates raw depth maps with four sparse patterns on Sintel, KITTI, and NYUv2 datasets.

Visualizations

Qualitative multi-view reconstruction results from Fig. 8 — Fig. 8. Qualitative results on multi-view reconstruction. All point clouds are presented in their raw form, without any alignment or filtering. Point clouds within the same row are displayed at a consistent scene scale.

Citation

@misc{wang2026unit,
      title={UniT: Unified Geometry Learning with Group Autoregressive Transformer},
      author={Haotian Wang and Yusong Huang and Zhaonian Kuang and Hongliang Lu and Xinhu Zheng and Meng Yang and Gang Hua},
      year={2026},
      eprint={2605.21131},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.21131},
}

UniT: Unified Geometry Learning
With Group Autoregressive Transformer

Overview

Examples

Demo