The Internet's Hidden 3D Model of the World - YouTube

Feature · May 2026

Reshaping Open-Source Intelligence

A series of landmark advances in photogrammetry, neural rendering, and sparse-scene reconstruction has quietly crossed a threshold — transforming every smartphone photo ever uploaded into potential raw material for intelligence-grade 3D models of the world.

Analysis

May 15, 2026

Bottom Line Up Front:

Advances in structure-from-motion, neural radiance fields, 3D Gaussian splatting, and the April 2026 MegaDepth-X dataset have, in combination, solved a longstanding barrier to planet-scale 3D reconstruction from ordinary internet photographs. A parallel and physically complementary capability — freely available satellite SAR imagery from ESA's Sentinel-1 constellation, now augmented by commercial constellations from ICEYE and Capella Space — provides all-weather, day-and-night geometric constraints and coherent change detection that optical photography cannot replicate. Researchers have already demonstrated that free Sentinel-1 data, processed with open-source tools, can detect 92.5% of conflict-induced building damage in Gaza with a 1.2% false positive rate — matching UNOSAT ground truth without any paid imagery. A January 2026 paper presents the first framework for fusing 3D SAR point clouds directly with neural surface reconstruction, combining the geometric rigor of radar with the photorealistic texture of optical imagery. The intelligence community — via IARPA's WRIVA program — has been funding precisely this multisource fusion capability since 2023. The convergence means that open-source intelligence practitioners, state actors, and eventually automated systems will be able to generate photorealistic, geometrically precise, continuously monitored 3D site models of virtually any location on Earth, in any weather, at any time of day — from free public data sources — without deploying aircraft, satellites, or ground vehicles of their own. The civil liberties, operational security, and geopolitical implications have yet to be seriously addressed by law, policy, or public discourse.

I. Rome, Rebuilt in a Day

In 2009, a team led by computer scientist Sameer Agarwal at the University of Washington downloaded thousands of tourist snapshots of Rome off Flickr and, by reverse-engineering the position and orientation of every camera in three-dimensional space, stitched them into a coherent 3D model of the Eternal City. The technique — structure-from-motion (SfM) — was not new, but the scale was startling. They called the paper "Building Rome in a Day," and the title was literal: processing took roughly 24 hours on a cluster. The paper introduced skeletal sets and bundle adjustment at planetary scale, techniques that remain the architectural foundation of photogrammetric mapping systems, including Google Street View, today.

By 2015, a University of North Carolina team had extended the same approach to reconstruct the entire photographed planet from Flickr imagery in approximately six days. The computer vision community had dramatically scaled the pipeline. And yet it had run, with increasing precision, into the same wall every time it tried to go further.

That wall has a name: the long-tail problem. The photographic record of the internet is radically uneven. A handful of icons — the Eiffel Tower, the Colosseum, Times Square — are represented by millions of overlapping images from thousands of angles, providing the dense coverage these algorithms require. But everywhere else — the local fort, the suburban overpass, the contested building on the edge of a conflict zone — there may be five photographs, taken by different phones over several years, at wildly inconsistent angles and in different lighting conditions. Classical and learned 3D methods collapse under these conditions. The algorithms simply cannot stitch what they cannot match.

For OSINT practitioners, that wall has been the binding constraint for a decade. The most consequential locations — precisely those where journalists, researchers, or intelligence analysts lack physical access — are also the least densely photographed. The tail wags the dog.

A paper published in April 2026, and accepted at CVPR 2026, appears to have punched through that wall.

II. The Long-Tail Problem, Solved by Subtraction

MegaDepth-X, produced by Yuan Li, Yuanbo Xiangli, Hadar Averbuch-Elor, Noah Snavely, and Ruojin Cai at Cornell University, attacks the training-data bootstrapping problem with an elegant trick that, in retrospect, seems almost obvious. You cannot train a model to reconstruct sparse, poorly-covered scenes because there is no ground truth against which to verify correctness — nothing else can reconstruct them, so there are no answer keys. The team's insight: simulate the long tail from the data you already have.

They took the dense, well-reconstructed internet landmarks where high-quality 3D ground truth exists, and deliberately discarded most of the photographs — throwing away images to simulate the sparse, uneven, poorly-connected coverage characteristic of obscure real-world sites. The deliberately impoverished image sets were then used to train 3D foundation models, giving those models supervised learning experience on conditions that mimic the actual long tail of the internet photograph corpus.

The pipeline combines MASt3R-SfM for initial reconstruction with Doppelgangers++ — a transformer trained to detect and resolve bilateral symmetry ambiguities, the class of error that causes architecturally symmetric buildings like Vienna's Belvedere Palace to fold in on themselves in reconstruction — and COLMAP multi-view stereo for dense depth maps. The result is a large-scale dataset of clean 3D reconstructions with dense depth, paired with a sparsity-aware sampling strategy.

Fine-tuning two leading 3D foundation models on MegaDepth-X produced striking results. On the hardest sparse-scene benchmarks, the state-of-the-art model π³ (Pi Cube) moved from 75% rotation accuracy to 86% after fine-tuning — a significant jump on a metric where every percentage point represents a class of real-world scenes that previously failed. The paper's conclusion is direct: the long-tail regime of internet photo collections, comprising the vast majority of the world's photographable surface, is now tractable.

"We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models." — Li et al., "Long-Tail Internet Photo Reconstruction," CVPR 2026 (arXiv:2604.22714)

III. The Technological Stack Beneath the Breakthrough

MegaDepth-X did not arrive from nowhere. It represents the capstone of an 18-year arc of compounding advances, each of which expanded what was achievable from publicly available imagery.

Year	Advance	Significance for OSINT
2009	Building Rome in a Day (Agarwal et al., UW)	Demonstrated landmark-scale 3D reconstruction from uncontrolled internet photos; established SfM toolkit
2015	Planet-scale Flickr reconstruction (UNC)	Same techniques at global scale in ~6 days; exposed long-tail limitation
2018	MegaDepth (Li & Snavely, Cornell)	Used dense reconstructions as training data for single-image depth prediction; enabled per-pixel geometry from one photo
2020	NeRF (Mildenhall et al., UCB / Google)	Neural representation of entire scenes, including lighting and view-dependent effects; set new quality ceiling
2021	NeRF in the Wild (Martin-Brualla et al., Google)	Handled uncontrolled internet photos with tourists; enabled temporal lighting disentanglement
2023	3D Gaussian Splatting (Kerbl et al.)	100+ FPS real-time rendering in browser; made scenes explicitly editable; replaced NeRF's computation bottleneck
2024	MegaScenes dataset (Tung, Snavely et al., Cornell/Stanford/Adobe)	430K scenes, 8.8M images, 100K+ SfM reconstructions from Wikimedia Commons; infrastructure for foundation model training
2024	Wild Gaussians; DUSt3R / MASt3R (Leroy et al.)	Real-time browser lighting control; end-to-end feed-forward reconstruction bypassing SfM entirely
2025	VGGT (Wang et al., CVPR Best Paper)	1.2B-parameter transformer produces cameras, depth maps, and 3D point maps in a single forward pass in seconds
2025	π³ / Pi Cube	Removed VGGT's anchor-photo dependency; robust reconstruction without a reference frame
2025	Skyfall-GS (Lee et al.)	Satellite imagery → walkable city-block 3D scenes using Gaussian splatting + diffusion gap-filling; no street-level data required
2026	MegaDepth-X (Li, Snavely et al., CVPR 2026)	Long-tail regime unlocked: sparse, noisy, uneven internet photo sets now yield coherent 3D models
2014–	Sentinel-1 (ESA Copernicus), free open archive	Decade-long, all-weather C-band SAR archive of entire Earth; basis for InSAR/CCD damage detection worldwide
2024–	Commercial SAR constellation: ICEYE (40+ sats), Capella (15 sats), Umbra; NRO SCE BAA contracts	Sub-25cm resolution, sub-3hr revisit; bistatic 3D collection; open data archives with no paywall
2026	SAR-Neural Surface Fusion (Li et al., arXiv:2601.22045)	First framework fusing 3D SAR point clouds with neural surface reconstruction; resolves sparse-optical ambiguity with radar geometry

The transition from NeRF to 3D Gaussian splatting is particularly consequential for operational use. Neural radiance fields encoded an entire scene into a neural network's weights, delivering beautiful results at prohibitive computational cost — rendering a single frame required querying the network at millions of points. Gaussian splatting instead represents scenes as clouds of oriented, semi-transparent ellipsoidal primitives called Gaussians. Because GPUs can rasterize these directly, rendering rates exceeding 100 frames per second in a browser became achievable. Crucially, the scene representation is explicit and editable: once you have Gaussians, every downstream manipulation — segmentation, change detection, temporal comparison — becomes tractable in a way it never was with implicit neural networks.

VGGT, which won the Best Paper Award at CVPR 2025 — the premier computer vision conference, with roughly 13,000 submitted papers — demonstrated that a 1.2-billion-parameter transformer trained on diverse scene datasets could take any pile of photos and return camera poses, per-image depth maps, and a unified 3D point cloud in a single forward pass, in seconds. No iterative bundle adjustment. No feature-matching pipeline. One pass through the network.

Key Concept: Feed-Forward 3D Reconstruction Classical photogrammetry required iterative optimization: feature matching across image pairs, geometric constraint solving, and bundle adjustment — a process that could take hours or days for a large scene. Modern feed-forward models like VGGT accomplish the same in a single neural network forward pass lasting seconds. When combined with MegaDepth-X fine-tuning, this speed now extends to the sparse, irregular image sets characteristic of most of the world's photographable surface.

IV. The Intelligence Community's Parallel Track

The timing of IARPA's WRIVA program is instructive. On July 12, 2023 — well before MegaDepth-X and while VGGT was still in development — the Intelligence Advanced Research Projects Activity, the advanced research arm of the Office of the Director of National Intelligence, publicly launched a 42-month effort with a strikingly similar objective.

The Walk-through Rendering from Images of Varying Altitude (WRIVA) program, as described in the official ODNI press release, "seeks to produce innovations that will advance 3-D site modelling capabilities far beyond today's state of the art, giving personnel virtual 'ground truth' with unrivaled insights into locations that would be difficult, if not impossible, to view." The program's stated goal is to build photorealistic, navigable 3D site models from a highly limited corpus of imagery — ground-level photographs, traffic camera footage, drone shots, and satellite data — in precisely those areas where dense coverage does not exist.

"WRIVA's ability to help users visually see and plan a mission or activity, despite limited access to imagery, will be a game-changer for the IC and others who require a deep grasp of the physical environment they will be operating in," said WRIVA Program Manager Ashwini Deshpande, who came to IARPA from the National Geospatial-Intelligence Agency. "And while it will not be better than reality, it might mean the difference between mission failure and success."

IARPA awarded WRIVA research contracts to organizations spanning 34 institutions, non-profits, and businesses. The test and evaluation team consists of Johns Hopkins University Applied Physics Laboratory, the MITRE Corporation, and MIT Lincoln Laboratory — three of the organizations most deeply embedded in the intelligence community's technical infrastructure. The program's public challenge data, released through the ULTRRA workshop at WACV 2025, includes images calibrated using RTK-corrected GPS coordinates to centimeter accuracy.

The Abbottabad comparison is apt and has been drawn explicitly in public reporting. In that 2011 operation, the CIA famously built a physical scale model of bin Laden's compound for mission rehearsal. WRIVA would replace physical models with photorealistic navigable digital replicas built from sparse, heterogeneous, publicly available imagery — extending the same capability to any location on Earth that appears in photographs at any altitude.

MegaDepth-X, it should be noted, was funded in part by Korea's National Research Foundation AI Lab Project — a signal that multiple state actors are simultaneously pursuing the same capability along parallel tracks, whether through academic grants, intelligence programs, or commercial development.

V. From Orbit to the Ground: Closing the Vertical Gap

One of the persistent limitations of satellite-based site modeling has been the altitude discontinuity: satellite imagery captures rooftops and urban canopies, but facades, entrances, courtyards, and street-level geometry remain occluded or absent. Skyfall-GS, published by Lee and colleagues in October 2025, directly addresses this gap.

The system takes multi-view satellite imagery as its sole input and produces immersive, navigable city-block-scale 3D scenes by combining 3D Gaussian splatting — which provides coarse geometric scaffolding from the overhead views — with a text-to-image diffusion model that synthesizes realistic facade and street-level appearance where the satellite data provides no direct evidence. A curriculum-learning refinement strategy progressively lowers the virtual camera from near-nadir (85 degrees) to oblique (45 degrees) across five passes, generating 54 synthetic views per pass with text-guided appearance synthesis filling occluded regions.

The result is a free-flight walkable 3D scene of an urban area requiring no street-level photography, no LiDAR, and no aerial survey — only the satellite imagery that is commercially available for virtually any populated place on Earth. Skyfall-GS does not require costly 3D annotations and allows for real-time, immersive 3D exploration of the final product.

When Skyfall-GS (which attacks the coverage problem from above) is considered alongside MegaDepth-X (which attacks it from ground-level sparse photo collections), the two systems converge on the same target from opposite directions: a photorealistic, navigable, continuously updated 3D representation of the world built from whatever imagery happens to be available — satellite, tourist snapshot, social media post, security camera frame — with generative diffusion models filling the gaps.

VI. The Radar Dimension: Open SAR as the All-Weather Geometric Backbone

The photogrammetric pipeline described above shares a fundamental constraint with all optical sensing: it requires light. Clouds, darkness, smoke, and haze all degrade or destroy optical imagery. Over a conflict zone, over a denied area in persistent cloud cover, or over a facility actively obscured, the photograph corpus on which photogrammetric reconstruction depends may be sparse not merely because the site is obscure but because the sky is reliably opaque. Synthetic aperture radar removes that constraint entirely — and a growing archive of open, freely downloadable SAR data is now available to anyone with an internet connection.

SAR operates by transmitting microwave pulses from an airborne or spaceborne platform, illuminating the surface with its own energy and recording the complex backscattered return. Because the wavelengths used — centimeters for C-band, decimeters for L-band — pass through cloud cover, precipitation, and darkness without attenuation, SAR produces coherent, repeatable imagery under all atmospheric conditions. More importantly for 3D reconstruction, SAR records a physically different quantity than a camera: rather than the color and texture of a surface as seen from a particular angle in particular lighting, it records the complex amplitude and phase of the radar return, encoding information about surface geometry, dielectric properties, and structural coherence that optical sensors simply cannot see.

The cornerstone of open SAR access is the European Space Agency's Copernicus Sentinel-1 constellation. The program now operates four satellites: Sentinel-1A (launched April 2014), Sentinel-1C (December 2024), and Sentinel-1D (November 2025), providing C-band SAR imagery at resolutions down to 5 meters with swaths up to 400 kilometers. With the full constellation operational, revisit times over most of the world's land surface are measured in days. All Sentinel-1 data is free, full, and open — accessible without registration through the Copernicus Data Space Ecosystem and mirrored as a cloud-optimized GeoTIFF archive on Amazon Web Services. Since the program's inception, the former Sentinel Open Access Hub served nearly 760,000 registered users and disseminated 590 petabytes of data before transitioning to the new ecosystem in 2023. This represents an essentially inexhaustible, freely available archive of all-weather radar imagery of the entire Earth's surface, extending back more than a decade.

The commercial SAR sector has expanded dramatically in parallel. ICEYE, the Finnish firm that pioneered the smallsat SAR market, now operates the world's largest commercial SAR constellation — more than 40 satellites as of 2026, including six launched on a SpaceX Transporter-16 rideshare mission on March 30, 2026, delivering 25-centimeter resolution imagery with sub-3-hour revisit capability over key regions. In December 2024, ICEYE announced an open data initiative making portions of its archive available through Amazon Web Services with no registration and no paywall. Capella Space operates 15 satellites capable of sub-25-centimeter spotlight imagery, and has demonstrated bistatic collection — two satellites imaging the same scene simultaneously from different angles — a capability that directly enables 3D reconstruction from radar. The National Reconnaissance Office renewed Stage III contracts with Capella, ICEYE US, and Umbra in July 2024 through July 2026, confirming that the intelligence community is actively operationalizing commercial SAR alongside its own classified assets.

Open SAR Archives: What Is Available Today Sentinel-1 (ESA/Copernicus): free, full and open, 5m resolution, all land surfaces, archive from 2014. Access: dataspace.copernicus.eu and AWS Registry of Open Data. ICEYE Open Data Initiative: no registration, no paywall, full archive including GRD, SLC, and COG formats, accessible via STAC browser and S3. ALOS-2 (JAXA): research datasets available under application; specialized L-band penetration for vegetation and subsurface. JAXA's PALSAR-2 global mosaics are publicly available annually. Alaska Satellite Facility: free access to Sentinel-1, ALOS PALSAR, and ERS data via Vertex search portal.

What SAR Adds That Photographs Cannot

SAR data contributes to the reconstruction pipeline in ways that are qualitatively distinct from optical photography. Three capabilities are particularly significant for OSINT applications.

First, SAR provides absolute geometric constraints that optical reconstruction lacks. The phase of the SAR return encodes path length with centimeter precision; when two SAR passes of the same scene are differenced interferometrically — a technique called InSAR — the resulting phase difference map directly measures vertical displacement of the surface between the two acquisitions. Across a city, this yields a digital elevation model with meter-class accuracy from a single satellite pass. When stacked over time using persistent scatterer techniques, InSAR time series can measure building heights, subsidence, uplift, and structural deformation to millimeter precision. This is the geometric information that photogrammetry requires many overlapping photographs to approximate — SAR provides it directly, even from a single side-looking flight path.

Second, SAR enables coherent change detection (CCD) — the detection of subtle structural changes between two acquisitions by measuring the loss of phase coherence between them. When a building is undamaged between two passes, the complex SAR signal coherence between those passes is high: the same scatterers return the same phase. When a building is damaged, collapses, or has its rubble disturbed, coherence drops sharply. This is a physically grounded, weather-independent, day-and-night-capable damage signal that optical imagery cannot replicate.

Third, the temporal density of SAR archives gives it a change-detection cadence that no photograph corpus can match. Sentinel-1 acquires global coverage every few days regardless of cloud cover. A Gaussian splatting model trained on sparse optical photography provides the photorealistic texture and appearance; the SAR time series provides the continuous structural monitoring layer — flagging when anything in the scene has physically changed between the two most recent passes.

SAR-Neural Reconstruction Fusion: A January 2026 Landmark

A paper published on arXiv in January 2026 by Da Li, Chen Yao, and colleagues — "Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion" — represents the first framework to formally combine 3D SAR point clouds with aerial optical imagery for neural surface reconstruction. The paper's opening framing is directly relevant to the OSINT context: neural surface reconstruction from multi-view aerial imagery suffers from geometric ambiguity under sparse-view conditions — precisely the condition that characterizes the intelligence-relevant long tail. Their solution: inject radar-derived geometric priors directly into the neural surface reconstruction pipeline, using the SAR point cloud to guide structure-aware ray selection and adaptive sampling. The result is markedly improved reconstruction accuracy, completeness, and robustness compared with single-modality (optical only) baselines, under the highly sparse and oblique-view conditions that characterize the practical limits of available imagery.

The physical intuition is straightforward, and will be immediately familiar to anyone with radar engineering experience: building facades, edges, and window recesses exhibit strong, reproducible scattering responses in SAR imagery because urban structures are highly efficient radar reflectors. Double-bounce returns from wall-ground junctions, trihedral corner reflections from building edges, and distributed scattering from rough surfaces all produce characteristic signatures. Where a sparse optical image set may leave a building's geometry ambiguous, the SAR point cloud derived from even a single collection pass provides robust geometric anchoring for the neural surface reconstruction. The two modalities are physically complementary: SAR sees geometry through the physics of microwave scattering; optical imagery captures the photorealistic texture and color that SAR cannot provide.

SAR in Active Conflict: Sentinel-1 as a War Crimes Monitor

The operational precedent for open SAR data in OSINT already exists, and is substantial. Beginning with the Russian invasion of Ukraine in February 2022, researchers at Oregon State University and elsewhere demonstrated that Sentinel-1 InSAR coherent change detection could map building damage across the entire country from freely available radar imagery. The resulting paper, published in the journal Science of Remote Sensing in 2025, produced nationwide damage data with three-month latency from a purely open-source pipeline. The same team's follow-on work addressed Gaza, analyzing 321 openly accessible Sentinel-1 SAR images acquired before and during the conflict to produce more than 3,200 coherence images and track weekly damage trends across 330,079 building footprints. Their long temporal-arc CCD approach detected 92.5% of damage labels in reference data from the United Nations Satellite Center with a false positive rate of only 1.2% — a performance level that provides legally useful evidence from satellites costing nothing to access.

A parallel study applied Sentinel-1 CCD to the urban areas of Mariupol and Kharkiv specifically, using the spatial and temporal fidelity of the coherence time series to track the progression of the siege of Mariupol and the battle of Kharkiv at the level of individual city blocks, cross-referenced against cultural property identified in OpenStreetMap. These results were produced by academic researchers using free satellite data, open-source processing tools, and cloud computing platforms — tools accessible to any well-resourced OSINT practitioner, not merely national intelligence agencies.

VII. The OSINT Ecosystem That Will Use These Tools

Open-source intelligence has undergone its own revolution over the roughly same period. Bellingcat, founded by British journalist Eliot Higgins in 2014, demonstrated that rigorous analytical work combining satellite imagery, social media photographs, shadow geometry, road sign analysis, and architectural cross-referencing could geolocate military hardware, document atrocities, and identify war criminals — work previously requiring dedicated national intelligence assets. Their investigation of the MH17 shootdown reconstructed the route of the Buk missile launcher through eastern Ukraine from tourist Instagram posts and dash-cam videos.

The conflict in Ukraine accelerated the maturation of the OSINT ecosystem dramatically. Volunteer communities using platforms like GeoConfirmed geolocate and verify visual content from conflict zones in near-real time, building interactive verified-event maps with full geolocation proofs, order-of-battle diagrams, and equipment tracking. The Conflict Observatory, backed by USAID and using geospatial AI tools, has documented Russian war crimes from commercially available satellite imagery and open social media sources for eventual use in legal proceedings. The HALO Trust has used OSINT and GIS mapping to document explosive hazard locations across Ukraine, verifying over 8,000 data points to guide demining operations.

All of these workflows currently depend on 2D analysis: geolocating a photograph means identifying what can be seen in the image and matching it to known map features, satellite imagery, or other photographs. What the new generation of 3D reconstruction tools enables is a qualitative step beyond that: the ability to synthesize disparate, sparse, uncoordinated image collections into a coherent volumetric model of a site, from which analysts can extract viewing angles, measurement data, change detection over time, and operational planning intelligence that no single photograph can provide.

An analyst examining a target compound who previously had three social media photographs, a satellite thumbnail, and a traffic camera frame — insufficient for classical reconstruction — can now potentially produce a navigable 3D site model integrating all of that imagery. The same tools that let an entertainment company reconstruct a concert venue for virtual reality will let an OSINT researcher reconstruct a detention facility from a handful of prisoner photographs posted on social media.

"The same tech shows up in a Cornell research paper, a Netflix VFX tool, and an intelligence program — usually within months of each other." — Bilawal Sidhu, "Building a 3D Model of the World from Internet Photos," Spatial Intelligence, May 2026

VIII. YouTube and the Videogrammetry Frontier

Every argument made so far about still photographs applies with equal or greater force to video — and the world's video archive dwarfs its photograph archive by orders of magnitude. YouTube alone reports more than 500 hours of video uploaded every minute. A significant fraction of that upload stream consists of travel vlogs, walking tours, drone footage, dashcam recordings, military equipment reviews, conflict documentation, and urban exploration content — precisely the kinds of footage that structure-from-motion algorithms can exploit for 3D reconstruction. The question of whether anybody has seriously looked at YouTube as a reconstruction source is easy to answer: the computer vision community has been doing it for years, and the results are now operationally relevant.

Videogrammetry — the systematic extraction of 3D geometry from video streams — traces its academic lineage through the same structure-from-motion literature as still photography reconstruction. A 2026 survey in the Journal of Imaging identified approximately 6,863 published studies on video-based 3D reconstruction through the end of 2024, covering structure-from-motion, multi-view stereo, and visual SLAM approaches applied to monocular video. The fundamental mechanics are the same as with photographs: each video frame is a photograph, and a video stream at 30 frames per second taken while the camera moves provides an extremely dense, continuous multi-view geometry problem with known temporal ordering — in many ways a cleaner input for reconstruction than a disorganized pile of tourist snapshots, because the frame-to-frame motion is smooth and predictable.

The intelligence community appreciated this as a capability even before the academic literature fully crystallized it. A U.S. Army Corps of Engineers patent from the late 2010s specifically describes "a method of processing full motion video data for photogrammetric reconstruction," noting the military's goal of transforming millions of terabytes of full motion video (FMV) from unmanned vehicles, wireless cameras, and other sensors into usable 2D and 3D maps and models. That work — extracting still frames, recovering camera exterior orientation from on-board GPS and inertial data, and applying classical photogrammetric computational models — has been an active DoD program for over a decade.

What has changed is the democratization and scale. A YouTube walking tour of a city neighborhood, a drone flyover of a port facility, a dashcam recording of a road through a military installation's perimeter, a tourist's GoPro footage of a foreign capital — each of these, individually, is a structure-from-motion problem that feed-forward models like VGGT can now solve in seconds. Aggregated across thousands of videos uploaded by different people at different times, they become a time-resolved, multi-angle, continuously updated 3D model of the world with a temporal density no satellite constellation can match.

The mannequin challenge precedent noted in earlier research is telling: Cornell researchers scraped 2,000 freeze-frame videos from YouTube in 2016 specifically because the static-human, moving-camera format was a free structure-from-motion training set with humans embedded. The same logic applies to any video genre where the camera moves through a scene while the background is static — which describes the large majority of travel, urban exploration, and documentary video content. MegaScenes and MegaDepth-X were built on still photographs from Wikimedia Commons; there is no technical barrier to building an equivalent dataset from YouTube at ten times the scale, and the intelligence community almost certainly has programs doing exactly this with non-public video archives.

Video as a Reconstruction Source: What Each Format Provides Walking tour / vlog: dense, continuous camera trajectory at street level; ideal for building facade reconstruction. Drone footage: oblique-angle coverage of rooftops and courtyards inaccessible to street-level photography; complementary to satellite nadir imagery. Dashcam video: road-corridor reconstruction with high temporal density; effectively a private Street View. Live-stream / conflict video: unplanned, often shaky, but geolocation-anchored by background features and shadows; valuable precisely because it is uncontrolled. Sports broadcast / stadium footage: multi-camera, known camera positions, high frame rate; near-perfect reconstruction input. The feed-forward models that solve still-photo reconstruction in seconds are equally applicable to extracted video frames, with the additional benefit of temporal ordering as an additional geometric constraint.

IX. The Corruption Vector: AI-Generated Imagery and the Integrity of the Reconstruction Pipeline

The same generative models that fill coverage gaps in Skyfall-GS and the WRIVA pipeline introduce a profound vulnerability to the broader reconstruction ecosystem: the progressive contamination of the open internet photograph corpus with imagery that has no geometric ground truth. The question is not merely whether AI-generated images look convincing to a human viewer. The question is whether they look convincing to a feature-matching algorithm — and the answer is deeply problematic, for reasons that are both well-understood and not yet solved.

AI-generated images produced by diffusion models and GANs share a characteristic set of geometric failures. A 2024 forensic analysis found that generated images frequently exhibit inconsistent vanishing point geometry — projections of parallel lines that fail to converge correctly in two-dimensional space, particularly in architectural scenes. They often lack proper parallax: the apparent displacement of objects when viewed from slightly different positions that is the fundamental signal structure-from-motion algorithms exploit. They can produce physically impossible depth cues, inconsistent shadows and lighting, anomalous or atypical object proportions, and missing occlusion relationships. A separate 2024 study demonstrated that generative models cannot reliably replicate projective geometric relationships — the very relationships that photogrammetric algorithms use as their primary matching signal.

From a structure-from-motion perspective, an AI-generated image of a building injected into a reconstruction pipeline is not a photograph of a building from a specific physical viewpoint. It is a plausible hallucination of what such a photograph might look like, generated by statistical processes that learned the appearance distribution of buildings but were not constrained to maintain physical consistency across viewpoints. Feature matching will fail unpredictably — the algorithm will attempt to find corresponding points between the AI-generated image and real photographs, and will either produce no matches (causing the image to be excluded as an outlier) or, more dangerously, produce spurious matches that inject incorrect geometric constraints into the reconstruction, silently corrupting the resulting 3D model. The second failure mode is worse than the first because it is invisible: the reconstruction appears to succeed, produces a plausible-looking 3D model, but that model encodes false geometry.

Model Collapse: A Systemic Threat to Reconstruction Training Data

The problem extends beyond individual images. A landmark 2024 paper in Nature by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, and colleagues at Oxford, Cambridge, and DeepMind formally characterized "model collapse" — the process by which AI models trained on AI-generated data experience compounding information loss across successive generations. The mechanism is inexorable: AI-generated content overrepresents the statistical center of the training distribution and underrepresents the tails. When subsequent models train on this content, rare but valid examples disappear from the learned distribution. After multiple generations, the model converges on a narrow, homogeneous, increasingly unrealistic output distribution. The authors found this effect to be irreversible under standard training conditions: once the tails of the distribution have been lost, they cannot be recovered by further training on contaminated data.

For 3D reconstruction specifically, the tail of the distribution is precisely what matters for OSINT. The unusual viewpoints, the partially occluded scenes, the edge-case lighting conditions, the architecturally ambiguous structures — these are the inputs where the models are most needed and most likely to fail. If the training corpus for future versions of VGGT, π³, and their successors is contaminated with AI-generated photographs that systematically smooth over these hard cases, the resulting models will be progressively worse at exactly the long-tail reconstruction problem that MegaDepth-X was designed to solve. A 2025 computational study tracking year-by-year semantic similarity in major English-language corpora found an exponential acceleration in synthetic contamination coinciding with the public release of large language models beginning in 2018-2022. The same trajectory is playing out in the image domain.

Deepfake Geography: Weaponized Corruption

Beyond accidental contamination, there is the deliberate adversarial case. A November 2025 study evaluated the detection of AI-generated satellite imagery and found that generative models including StyleGAN2 and Stable Diffusion can now produce synthetic satellite images that are visually convincing at terrain scale, exhibiting the characteristic large-area texture consistency of real satellite imagery — even though they encode fictional geography. As the National Geospatial-Intelligence Agency director noted in 2019, AI-manipulated satellite images represent a severe national security threat. The implication has sharpened: if a state actor can inject convincing synthetic satellite imagery into the open datasets that OSINT practitioners, reconstruction pipelines, and AI training workflows rely on, they can corrupt the geometric models derived from those datasets, create false confidence in phantom infrastructure, and cause intelligence failures grounded in plausible-looking but physically nonexistent evidence.

The formal research on deepfake geography, pioneered by Bo Zhao and colleagues at the University of Washington, demonstrated that GANs can learn the statistical appearance of satellite imagery from one location and apply it to a different base map — effectively fabricating a convincing satellite photograph of a place that looks different from how it actually looks, or that shows infrastructure that does not exist. Unlike photograph forgery in a narrow sense, this is a statistical manipulation of the entire texture distribution of a geographic area, producing results that defeat casual visual inspection and, more importantly, that can pass standard automated quality filters that look for obvious artifacts rather than geometric ground truth.

SAR as a Corruption Firewall

This is where the SAR data stream discussed in Section VI becomes not merely complementary but potentially indispensable as a data integrity mechanism. An AI-generated photograph has no corresponding SAR return. A diffusion model hallucinating a building in a location where no building exists will not produce the double-bounce radar signature, the persistent scatterer behavior, or the InSAR phase coherence that a real building of that size and construction would generate in Sentinel-1 data. The physical measurement — microwave backscatter from actual matter — cannot be fabricated by a model that has never seen physical matter and has only learned the statistical appearance of photographs. SAR coherence time series, which can detect structural changes at millimeter precision, provide a chronological ground truth against which both claimed changes and newly injected imagery can be validated.

This creates a testable framework for reconstruction pipeline integrity: any image claiming to depict a specific location and time should produce feature-consistent matches with the SAR data record for that location and time. Buildings that appear in photographs but produce no SAR return, or that appear at dates inconsistent with the SAR change detection record, are candidates for synthetic fabrication. The fusion of SAR geometric constraints with optical reconstruction is thus simultaneously an accuracy improvement and a corruption detection mechanism — radar physics as a cryptographic ground truth for the photographic record.

The advances described above address static scenes. The world, of course, is not static. Researchers are now beginning to extend the same toolkit to casual handheld video, extracting four-dimensional reconstructions — geometry plus motion over time — from single phone video clips. Papers including MOSA and Shape of Motion are demonstrating the ability to decompose a monocular video into independently moving objects and their trajectories, not merely the static scene geometry.

The intelligence implications extend beyond site modeling to activity pattern analysis. Once the 3D structure of a site has been established, recurring video from the same location — security camera footage, drone surveillance, recurring social media posts — can in principle be registered against that model to track changes in vehicle positions, construction activity, personnel movements, and equipment staging over time. The same Gaussian splatting techniques that enable "Wild Gaussians" — separating a building from its variable lighting conditions across thousands of photographs taken over years — can be adapted to track the arrival and departure of specific objects against a stable 3D background model.

The temporal dimension also works backward. Archived social media photographs taken years before an event of interest can potentially be incorporated into a retroactive reconstruction of what a site looked like at a given time, providing historical baseline models against which current imagery can be compared to detect changes.

X. Privacy, Civil Liberties, and the Absence of Law

The legal and regulatory frameworks governing these capabilities lag the technology by years, and the gap is widening. A tourist photographing a famous landmark and posting the image on Instagram has consented, at most, to the platform's terms of service — terms that do not anticipate the use of that photograph as a data point in an intelligence-grade 3D reconstruction of the surrounding area, complete with the reflections in sunglasses and the geometry of windows visible in the background.

No current statute in the United States specifically addresses the mass aggregation of publicly posted photographs for 3D scene reconstruction. The Computer Fraud and Abuse Act governs unauthorized computer access, not the analytical use of public data. The Fourth Amendment's protections against unreasonable search and seizure have been construed narrowly in the context of publicly accessible spaces. The Electronic Communications Privacy Act does not address metadata aggregation at this scale. At the European level, GDPR's provisions on biometric data and location data offer somewhat stronger protection, but enforcement has not been tested against photogrammetric reconstruction pipelines that do not themselves collect data but analyze existing public collections.

The asymmetry is significant. State and commercial actors with computational resources can now build 3D models of any photographed location on Earth from existing public image archives. Private individuals whose photographs contribute to those models have no notification, no opt-out, and no legal recourse under current frameworks. This is the photogrammetric equivalent of the mass location tracking disclosed in Carpenter v. United States (2018) — in which the Supreme Court held that the government's warrantless collection of cell-site location records violated the Fourth Amendment — but operating on imagery rather than telecommunications metadata, and currently without a comparable legal check.

Legal Landscape: Current Gaps No U.S. statute specifically addresses the use of aggregated publicly posted photographs for 3D scene reconstruction. GDPR offers partial protections in Europe for identifiable personal data, but has not been tested against photogrammetric pipelines. Courts have not addressed whether a person photographing a public space retains any interest in that image when it is used as a geometric data point. The Carpenter precedent on location tracking has not been extended to photographic data aggregation. Operational security implications for military personnel and government officials who post photographs in sensitive areas are serious and undersupported by current policy.

XI. Operational Security Implications

The operational security implications for military and government personnel are immediate and concrete. A photograph posted from a sensitive installation — even one in which the installation itself is not visible — can contribute geometric and positional data to a reconstruction of the surrounding area. Metadata stripped from the photograph may still leave exploitable artifacts: the angle of shadows constrains the latitude, longitude, and time; the geometry of visible skylines and structures provides cross-referencing anchor points; and the aggregate of many such photographs taken by many individuals over time can yield a high-fidelity reconstruction of areas that no single photograph would reveal.

The Department of Defense has maintained guidance restricting photography on installations and prohibiting the posting of operationally sensitive imagery on social media, but enforcement is uneven and the underlying threat model has not been updated to account for photogrammetric aggregation of innocuous-seeming photographs. The same techniques that let Cornell researchers reconstruct a Viennese palace from tourist Flickr posts can reconstruct the exterior geometry of any base, facility, or compound from the holiday photographs, fitness app routes, and social media posts of the personnel who work there.

XII. Looking Forward: A Convergent World Model

The technology now in place constitutes what researchers have begun calling a "sensorium" — a continuously updated, increasingly complete, three-dimensional model of the photographed and radar-scanned world. The pieces have now fully assembled: a planetary archive of photographs already uploaded and searchable; an equally planetary archive of free SAR imagery extending back more than a decade through Sentinel-1 and growing daily; feed-forward 3D foundation models that can turn any collection of those photographs into a coherent geometric model in seconds; SAR-neural fusion frameworks that inject radar-derived geometric priors to constrain and complete what sparse optical imagery cannot; Gaussian splatting rendering that makes the resulting fused models interactable in real time in a browser; diffusion-based in-painting that fills coverage gaps where neither photographs nor radar provide sufficient detail; and now, with MegaDepth-X, the ability to extend photogrammetric reconstruction to the sparse, irregular, noisy image collections that characterize most of the world's surface.

The four-dimensional extension — tracking change over time from video — is underway. The fusion of ground-level, drone, and satellite imagery into unified models is operational. The intelligence community has been funding precisely this capability through WRIVA since 2023. The academic community has solved the remaining algorithmic barrier. The commercial ecosystem — in VFX, video gaming, real estate, autonomous vehicles, and augmented reality — will drive the cost of the necessary computation toward zero.

What has not kept pace is the governance framework. The photographic commons of the internet, accumulated over two decades of social media, smartphone ubiquity, and platform-enabled sharing, is about to be transformed into something its contributors did not consent to: the raw material for a god's-eye view of the world. The technology is ready. The law is not. The policy is not. And the public, whose photographs make it possible, is largely unaware that the transformation is underway.

Verified Sources & Formal Citations

Li, Y., Xiangli, Y., Averbuch-Elor, H., Snavely, N., & Cai, R. (2026). Long-tail Internet photo reconstruction. CVPR 2026. arXiv:2604.22714. https://arxiv.org/abs/2604.22714 | Project page: https://megadepth-x.github.io/
Wang, J., et al. (2025). VGGT: Visual Geometry Grounded Transformer. CVPR 2025 Best Paper Award. GitHub: https://github.com/facebookresearch/vggt
Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G., Hariharan, B., & Snavely, N. (2024). MegaScenes: Scene-level view synthesis at scale. ECCV 2024. arXiv:2406.11819. https://arxiv.org/abs/2406.11819 | Dataset: https://megascenes.github.io/
Office of the Director of National Intelligence / IARPA. (2023, July 12). IARPA launches effort to develop photorealistic site models [Press release]. https://www.dni.gov/index.php/newsroom/press-releases/press-releases-2023/3707-iarpa-launches-effort-to-develop-photorealistic-site-models
IARPA. (2023). WRIVA program page. https://www.iarpa.gov/research-programs/wriva
Intelligence Community News. (2023, July 13). IARPA launches WRIVA program. https://intelligencecommunitynews.com/iarpa-launches-wriva-program/
Dodd, T. (2023, July 14). Why US intelligence wants a new way to make virtual, 3D models. Popular Science. https://www.popsci.com/technology/iarpa-virtual-models/
Lee, J., et al. (2025). Skyfall-GS: Synthesizing immersive 3D urban scenes from satellite imagery. arXiv:2510.15869. https://arxiv.org/abs/2510.15869
The Decoder staff. (2025, November 2). Skyfall-GS turns satellite images into walkable 3D cities. The Decoder. https://the-decoder.com/skyfall-gs-turns-satellite-images-into-walkable-3d-cities/
Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (SIGGRAPH 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Martin-Brualla, R., Radwan, N., Sajjadi, M. S. M., Barron, J. T., Dosovitskiy, A., & Duckworth, D. (2021). NeRF in the Wild: Neural radiance fields for unconstrained photo collections. CVPR 2021. https://nerf-w.github.io/
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing scenes as neural radiance fields for view synthesis. ECCV 2020. https://www.matthewtancik.com/nerf
Wang, S., et al. (2024). DUSt3R: Geometric 3D vision made easy. CVPR 2024. https://dust3r.europe.naverlabs.com/
Leroy, V., et al. (2025). MASt3R: Matching and Stereo 3D Reconstruction. CVPR 2025. https://github.com/naver/mast3r
Li, Z., & Snavely, N. (2018). MegaDepth: Learning single-view depth prediction from internet photos. CVPR 2018. https://www.cs.cornell.edu/projects/megadepth/
ULTRRA Challenge. (2025). WRIVA workshop at WACV 2025. https://sites.google.com/view/ultrra-wacv-2025
MegaDepth-X dataset (Hugging Face). https://huggingface.co/datasets/y-u-a-n-l-i/MegaDepth-X
MegaScenes dataset (GitHub / AWS Open Data). https://github.com/MegaScenes/dataset
Sidhu, B. (2026, May). Building a 3D model of the world from internet photos. Spatial Intelligence (Substack). https://www.spatialintelligence.ai/p/building-a-3d-model-of-the-world
Chen, S. (2025, September 23). VGGT: How feed-forward 3D perception is redefining scene reconstruction. Medium / deMISTify. https://medium.com/demistify/vggt-how-feed-forward-3d-perception-is-redefining-scene-reconstruction
Aira, L. S., Facciolo, G., & Ehret, T. (2024). Gaussian splatting for efficient satellite image photogrammetry. arXiv:2412.13047. https://arxiv.org/abs/2412.13047
Aristotle University of Thessaloniki et al. (2023). Integrating Earth observation IMINT with OSINT data: A case study of the Ukraine–Russia war. Security and Defence Quarterly. https://securityanddefence.pl/…multisource,170901,0,2.html
Esri / HALO Trust. (2022). Open-source data documents war atrocities in Ukraine. https://www.esri.com/about/newsroom/blog/ukraine-open-source-intelligence
GeoConfirmed project page. https://geoconfirmed.org/
Bellingcat Online Investigation Toolkit (GitBook, current version). https://bellingcat.gitbook.io/toolkit
State of Surveillance. (2026, January 9). Geolocation OSINT: How investigators find where photos were taken. https://stateofsurveillance.org/articles/technical/geolocation-osint-photo-location-tracking/
CVPR 2026 poster listing. Long-tail Internet photo reconstruction. https://cvpr.thecvf.com/virtual/2026/poster/37828
Carpenter v. United States, 585 U.S. 296 (2018). [Cell-site location data / Fourth Amendment precedent.] https://www.supremecourt.gov/opinions/17pdf/16-402_h315.pdf
Li, D., Yao, C., Mao, T., Bao, J., & Sun, H. (2026). Urban neural surface reconstruction from constrained sparse aerial imagery with 3D SAR fusion. arXiv:2601.22045. https://arxiv.org/abs/2601.22045
Scher, C., & Van Den Hoek, J. (2025). Nationwide conflict damage mapping with interferometric synthetic aperture radar: A study of the 2022 Russia–Ukraine conflict. Science of Remote Sensing, 11, 100217. https://doi.org/10.1016/j.srs.2025.100217
Scher, C., & Van Den Hoek, J. (2025). Active InSAR monitoring of building damage in Gaza during the Israel-Hamas War. arXiv:2506.14730. https://arxiv.org/abs/2506.14730
Building Damage Assessment Portal — open-access Sentinel-1 InSAR CCD data for Ukraine and Gaza. https://rccd-damage-portal.netlify.app/
Ballinger, O. (2024). Open access battle damage detection via pixel-wise T-Test on Sentinel-1 imagery. arXiv:2405.06323. Published in Remote Sensing of Environment, 2025. https://www.sciencedirect.com/science/article/pii/S0034425725004298
Mavroulis, S., et al. (2024). Cultural heritage in times of crisis: Damage assessment in urban areas of Ukraine using Sentinel-1 SAR data. ISPRS International Journal of Geo-Information, 13(9), 319. https://www.mdpi.com/2220-9964/13/9/319
ESA Copernicus Sentinel-1 constellation overview. https://sentinels.copernicus.eu/copernicus/sentinel-1
Copernicus Data Space Ecosystem — Sentinel-1 free access portal. https://dataspace.copernicus.eu/data-collections/copernicus-sentinel-missions/sentinel-1
Sentinel-1 AWS Open Data Registry (Element 84). https://registry.opendata.aws/sentinel-1/
ICEYE Open Data Initiative (no-paywall SAR archive on AWS). https://www.iceye.com/open-data-initiative
Janes. (2026, April 30). ICEYE launches six new SAR satellites [SpaceX Transporter-16 rideshare, March 30, 2026]. https://www.janes.com/osint-insights/defence-news/air/update-iceye-launches-six-new-sar-satellites
Synthetic Aperture Radar news. (2024, December 17). NRO extends SAR contracts to Capella, ICEYE, and Umbra, Stage III SCE BAA, July 2024–July 2026. https://syntheticapertureradar.com/nro-extends-sar-contracts-to-capella-iceye-and-umbra-advancing-commercial-radar-strategy/
Atlas Institute for International Affairs. (2025, August 6). Open eyes in the high north: OSINT capabilities including Sentinel-1, ICEYE, and Capella Space. https://atlasinstitute.org/open-eyes-in-the-high-north-open-source-intelligence-capabilities-and-constraints/
Capella Space. (2025, April 7). How SAR is reshaping the Earth observation industry in 2025. https://www.capellaspace.com/blog/how-sar-is-reshaping-the-earth-observation-industry-in-2025
Afrosheh, S., & Askari, M. (2024). Geospatial data fusion: Combining LiDAR, SAR, and optical imagery with AI for enhanced urban mapping. arXiv:2412.18994. https://arxiv.org/abs/2412.18994
Mouget, A., et al. (2026). Video-based 3D reconstruction: A review of photogrammetry and visual SLAM approaches. Journal of Imaging, 12(3), 128. https://www.mdpi.com/2313-433X/12/3/128
U.S. Army Corps of Engineers / USPTO. Method of processing full motion video data for photogrammetric reconstruction. US Patent 10,553,022. https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10553022
Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631, 755–759. https://doi.org/10.1038/s41586-024-07566-y
Arabi, M., & Jalili, M. (2025). Future of AI models: A computational perspective on model collapse. arXiv:2511.05535. https://arxiv.org/abs/2511.05535
Sarkar, S., et al. (2024). Qualitative failures of image generation models and their application in detecting deepfakes. arXiv:2304.06470. https://arxiv.org/html/2304.06470v6 [vanishing point, parallax, and projective geometry failures in AI-generated images]
Okumura, R., Shiohara, K., & Yamasaki, T. (2024). ControlVP: Interactive geometric refinement of AI-generated images with consistent vanishing points. arXiv:2512.07504. https://arxiv.org/abs/2512.07504 [documents systematic vanishing point inconsistency in Sora, Stable Diffusion outputs]
Yerzhanuly, M. (2025). Deepfake geography: Detecting AI-generated satellite images. arXiv:2511.17766. https://arxiv.org/abs/2511.17766 [ViT detectors achieve 95.11% accuracy vs. CNN; terrain-level inconsistencies as detection signals]
Zhao, B., et al. (2021). A growing problem of 'deepfake geography': How AI falsifies satellite images. Cartography and Geographic Information Science. University of Washington news release: https://www.washington.edu/news/2021/04/21/a-growing-problem-of-deepfake-geography-how-ai-falsifies-satellite-images/
Global Investigative Journalism Network. (2025, December 17). Deepfake geography: How AI can now falsify satellite images. https://gijn.org/stories/deepfake-geography-how-ai-can-now-falsify-satellite-images/

Aerospace Electronic and Defense Systems

Friday, May 15, 2026

The God's-Eye View: A Cascade of Computer Vision Breakthroughs Leave No Place to Hide