Volumetric Video

Volumetric video captures real-world performances and environments as full 3D recordings — not flat video but spatial data that viewers can observe from any angle, walk around, or integrate into virtual environments. It represents the convergence of video capture and 3D reconstruction, producing content that exists in three dimensions rather than being projected onto a flat plane.

Traditional volumetric capture uses arrays of synchronized cameras (often 50-100+) arranged in a dome or stage configuration. The multi-view footage is processed through computer vision pipelines that reconstruct per-frame 3D geometry and texture. Companies like Microsoft (Mixed Reality Capture Studios), Metastage, and Dimension Studio have built dedicated volumetric stages for film, sports, and entertainment production.

The output is typically a sequence of textured 3D meshes — one per frame — that can be played back in game engines, AR/VR applications, or web viewers. The data rates are substantial: a single minute of high-quality volumetric video can occupy gigabytes of storage, creating challenges for streaming and distribution.

AI is transforming volumetric capture in several ways. Neural volumetric representations — including NeRF extensions and dynamic Gaussian splatting — can reconstruct volumetric content from far fewer cameras, potentially reducing capture from dedicated studios to handheld devices. Research demonstrates convincing 3D video from as few as 4-8 synchronized cameras, or even monocular video with learned priors.

Compression is another AI frontier. Neural codecs can compress volumetric sequences into compact representations that stream efficiently, analogous to how H.264/H.265 compress traditional video but in 3D. This is critical for making volumetric content practical for consumer delivery.

For spatial computing and mixed reality, volumetric video solves a fundamental content problem: how to bring real people and performances into 3D experiences. Sports broadcasts where you can choose your viewing angle, concerts experienced from within the audience, remote collaboration with life-sized 3D avatars of real people — these applications all depend on volumetric capture becoming more accessible and efficient.

The convergence with generative video models points toward a future where AI can synthesize volumetric content from 2D video or even text descriptions, creating 3D performances without physical capture at all.

Further Reading