I. Rome, Rebuilt in a Day
In 2009, a team led by computer scientist Sameer
Agarwal at the University of Washington downloaded thousands of tourist
snapshots of Rome off Flickr and, by reverse-engineering the position
and orientation of every camera in three-dimensional space, stitched
them into a coherent 3D model of the Eternal City. The technique —
structure-from-motion (SfM) — was not new, but the scale was startling.
They called the paper "Building Rome in a Day," and the title was
literal: processing took roughly 24 hours on a cluster. The paper
introduced skeletal sets and bundle adjustment at planetary scale,
techniques that remain the architectural foundation of photogrammetric
mapping systems, including Google Street View, today.
By 2015, a University of North Carolina team had extended the same
approach to reconstruct the entire photographed planet from Flickr
imagery in approximately six days. The computer vision community had
dramatically scaled the pipeline. And yet it had run, with increasing
precision, into the same wall every time it tried to go further.
That wall has a name: the long-tail problem. The photographic record
of the internet is radically uneven. A handful of icons — the Eiffel
Tower, the Colosseum, Times Square — are represented by millions of
overlapping images from thousands of angles, providing the dense
coverage these algorithms require. But everywhere else — the local fort,
the suburban overpass, the contested building on the edge of a conflict
zone — there may be five photographs, taken by different phones over
several years, at wildly inconsistent angles and in different lighting
conditions. Classical and learned 3D methods collapse under these
conditions. The algorithms simply cannot stitch what they cannot match.
For OSINT practitioners, that wall has been the binding constraint
for a decade. The most consequential locations — precisely those where
journalists, researchers, or intelligence analysts lack physical access —
are also the least densely photographed. The tail wags the dog.
A paper published in April 2026, and accepted at CVPR 2026, appears to have punched through that wall.
II. The Long-Tail Problem, Solved by Subtraction
MegaDepth-X, produced by Yuan Li, Yuanbo Xiangli,
Hadar Averbuch-Elor, Noah Snavely, and Ruojin Cai at Cornell University,
attacks the training-data bootstrapping problem with an elegant trick
that, in retrospect, seems almost obvious. You cannot train a model to
reconstruct sparse, poorly-covered scenes because there is no ground
truth against which to verify correctness — nothing else can reconstruct
them, so there are no answer keys. The team's insight: simulate the
long tail from the data you already have.
They took the dense, well-reconstructed internet landmarks where
high-quality 3D ground truth exists, and deliberately discarded most of
the photographs — throwing away images to simulate the sparse, uneven,
poorly-connected coverage characteristic of obscure real-world sites.
The deliberately impoverished image sets were then used to train 3D
foundation models, giving those models supervised learning experience on
conditions that mimic the actual long tail of the internet photograph
corpus.
The pipeline combines MASt3R-SfM for initial reconstruction with
Doppelgangers++ — a transformer trained to detect and resolve bilateral
symmetry ambiguities, the class of error that causes architecturally
symmetric buildings like Vienna's Belvedere Palace to fold in on
themselves in reconstruction — and COLMAP multi-view stereo for dense
depth maps. The result is a large-scale dataset of clean 3D
reconstructions with dense depth, paired with a sparsity-aware sampling
strategy.
Fine-tuning two leading 3D foundation models on MegaDepth-X produced
striking results. On the hardest sparse-scene benchmarks, the
state-of-the-art model Ï€³ (Pi Cube) moved from 75% rotation accuracy to
86% after fine-tuning — a significant jump on a metric where every
percentage point represents a class of real-world scenes that previously
failed. The paper's conclusion is direct: the long-tail regime of
internet photo collections, comprising the vast majority of the world's
photographable surface, is now tractable.
"We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models."
— Li et al., "Long-Tail Internet Photo Reconstruction," CVPR 2026 (arXiv:2604.22714)
III. The Technological Stack Beneath the Breakthrough
MegaDepth-X did not arrive from nowhere. It
represents the capstone of an 18-year arc of compounding advances, each
of which expanded what was achievable from publicly available imagery.
| Year |
Advance |
Significance for OSINT |
| 2009 |
Building Rome in a Day (Agarwal et al., UW) |
Demonstrated landmark-scale 3D reconstruction from uncontrolled internet photos; established SfM toolkit |
| 2015 |
Planet-scale Flickr reconstruction (UNC) |
Same techniques at global scale in ~6 days; exposed long-tail limitation |
| 2018 |
MegaDepth (Li & Snavely, Cornell) |
Used dense reconstructions as training data for single-image depth prediction; enabled per-pixel geometry from one photo |
| 2020 |
NeRF (Mildenhall et al., UCB / Google) |
Neural representation of entire scenes, including lighting and view-dependent effects; set new quality ceiling |
| 2021 |
NeRF in the Wild (Martin-Brualla et al., Google) |
Handled uncontrolled internet photos with tourists; enabled temporal lighting disentanglement |
| 2023 |
3D Gaussian Splatting (Kerbl et al.) |
100+ FPS real-time rendering in browser; made scenes explicitly editable; replaced NeRF's computation bottleneck |
| 2024 |
MegaScenes dataset (Tung, Snavely et al., Cornell/Stanford/Adobe) |
430K scenes, 8.8M images, 100K+ SfM reconstructions from Wikimedia Commons; infrastructure for foundation model training |
| 2024 |
Wild Gaussians; DUSt3R / MASt3R (Leroy et al.) |
Real-time browser lighting control; end-to-end feed-forward reconstruction bypassing SfM entirely |
| 2025 |
VGGT (Wang et al., CVPR Best Paper) |
1.2B-parameter transformer produces cameras, depth maps, and 3D point maps in a single forward pass in seconds |
| 2025 |
Ï€³ / Pi Cube |
Removed VGGT's anchor-photo dependency; robust reconstruction without a reference frame |
| 2025 |
Skyfall-GS (Lee et al.) |
Satellite imagery → walkable city-block 3D scenes using
Gaussian splatting + diffusion gap-filling; no street-level data
required |
| 2026 |
MegaDepth-X (Li, Snavely et al., CVPR 2026) |
Long-tail regime unlocked: sparse, noisy, uneven internet photo sets now yield coherent 3D models |
| 2014– |
Sentinel-1 (ESA Copernicus), free open archive |
Decade-long, all-weather C-band SAR archive of entire Earth; basis for InSAR/CCD damage detection worldwide |
| 2024– |
Commercial SAR constellation: ICEYE (40+ sats), Capella (15 sats), Umbra; NRO SCE BAA contracts |
Sub-25cm resolution, sub-3hr revisit; bistatic 3D collection; open data archives with no paywall |
| 2026 |
SAR-Neural Surface Fusion (Li et al., arXiv:2601.22045) |
First framework fusing 3D SAR point clouds with neural surface
reconstruction; resolves sparse-optical ambiguity with radar geometry |
The transition from NeRF to 3D Gaussian splatting is particularly
consequential for operational use. Neural radiance fields encoded an
entire scene into a neural network's weights, delivering beautiful
results at prohibitive computational cost — rendering a single frame
required querying the network at millions of points. Gaussian splatting
instead represents scenes as clouds of oriented, semi-transparent
ellipsoidal primitives called Gaussians. Because GPUs can rasterize
these directly, rendering rates exceeding 100 frames per second in a
browser became achievable. Crucially, the scene representation is
explicit and editable: once you have Gaussians, every downstream
manipulation — segmentation, change detection, temporal comparison —
becomes tractable in a way it never was with implicit neural networks.
VGGT, which won the Best Paper Award at CVPR 2025 — the premier
computer vision conference, with roughly 13,000 submitted papers —
demonstrated that a 1.2-billion-parameter transformer trained on diverse
scene datasets could take any pile of photos and return camera poses,
per-image depth maps, and a unified 3D point cloud in a single forward
pass, in seconds. No iterative bundle adjustment. No feature-matching
pipeline. One pass through the network.
Key Concept: Feed-Forward 3D Reconstruction
Classical photogrammetry required iterative optimization: feature
matching across image pairs, geometric constraint solving, and bundle
adjustment — a process that could take hours or days for a large scene.
Modern feed-forward models like VGGT accomplish the same in a single
neural network forward pass lasting seconds. When combined with
MegaDepth-X fine-tuning, this speed now extends to the sparse, irregular
image sets characteristic of most of the world's photographable
surface.
IV. The Intelligence Community's Parallel Track
The timing of IARPA's WRIVA program is instructive.
On July 12, 2023 — well before MegaDepth-X and while VGGT was still in
development — the Intelligence Advanced Research Projects Activity, the
advanced research arm of the Office of the Director of National
Intelligence, publicly launched a 42-month effort with a strikingly
similar objective.
The Walk-through Rendering from Images of Varying Altitude (WRIVA)
program, as described in the official ODNI press release, "seeks to
produce innovations that will advance 3-D site modelling capabilities
far beyond today's state of the art, giving personnel virtual 'ground
truth' with unrivaled insights into locations that would be difficult,
if not impossible, to view." The program's stated goal is to build
photorealistic, navigable 3D site models from a highly limited corpus of
imagery — ground-level photographs, traffic camera footage, drone
shots, and satellite data — in precisely those areas where dense
coverage does not exist.
"WRIVA's ability to help users visually see and plan a mission or
activity, despite limited access to imagery, will be a game-changer for
the IC and others who require a deep grasp of the physical environment
they will be operating in," said WRIVA Program Manager Ashwini
Deshpande, who came to IARPA from the National Geospatial-Intelligence
Agency. "And while it will not be better than reality, it might mean the
difference between mission failure and success."
IARPA awarded WRIVA research contracts to organizations spanning 34
institutions, non-profits, and businesses. The test and evaluation team
consists of Johns Hopkins University Applied Physics Laboratory, the
MITRE Corporation, and MIT Lincoln Laboratory — three of the
organizations most deeply embedded in the intelligence community's
technical infrastructure. The program's public challenge data, released
through the ULTRRA workshop at WACV 2025, includes images calibrated
using RTK-corrected GPS coordinates to centimeter accuracy.
The Abbottabad comparison is apt and has been drawn explicitly in
public reporting. In that 2011 operation, the CIA famously built a
physical scale model of bin Laden's compound for mission rehearsal.
WRIVA would replace physical models with photorealistic navigable
digital replicas built from sparse, heterogeneous, publicly available
imagery — extending the same capability to any location on Earth that
appears in photographs at any altitude.
MegaDepth-X, it should be noted, was funded in part by Korea's
National Research Foundation AI Lab Project — a signal that multiple
state actors are simultaneously pursuing the same capability along
parallel tracks, whether through academic grants, intelligence programs,
or commercial development.
V. From Orbit to the Ground: Closing the Vertical Gap
One of the persistent limitations of satellite-based
site modeling has been the altitude discontinuity: satellite imagery
captures rooftops and urban canopies, but facades, entrances,
courtyards, and street-level geometry remain occluded or absent.
Skyfall-GS, published by Lee and colleagues in October 2025, directly
addresses this gap.
The system takes multi-view satellite imagery as its sole input and
produces immersive, navigable city-block-scale 3D scenes by combining 3D
Gaussian splatting — which provides coarse geometric scaffolding from
the overhead views — with a text-to-image diffusion model that
synthesizes realistic facade and street-level appearance where the
satellite data provides no direct evidence. A curriculum-learning
refinement strategy progressively lowers the virtual camera from
near-nadir (85 degrees) to oblique (45 degrees) across five passes,
generating 54 synthetic views per pass with text-guided appearance
synthesis filling occluded regions.
The result is a free-flight walkable 3D scene of an urban area
requiring no street-level photography, no LiDAR, and no aerial survey —
only the satellite imagery that is commercially available for virtually
any populated place on Earth. Skyfall-GS does not require costly 3D
annotations and allows for real-time, immersive 3D exploration of the
final product.
When Skyfall-GS (which attacks the coverage problem from above) is
considered alongside MegaDepth-X (which attacks it from ground-level
sparse photo collections), the two systems converge on the same target
from opposite directions: a photorealistic, navigable, continuously
updated 3D representation of the world built from whatever imagery
happens to be available — satellite, tourist snapshot, social media
post, security camera frame — with generative diffusion models filling
the gaps.
VI. The Radar Dimension: Open SAR as the All-Weather Geometric Backbone
The photogrammetric pipeline described above shares a
fundamental constraint with all optical sensing: it requires light.
Clouds, darkness, smoke, and haze all degrade or destroy optical
imagery. Over a conflict zone, over a denied area in persistent cloud
cover, or over a facility actively obscured, the photograph corpus on
which photogrammetric reconstruction depends may be sparse not merely
because the site is obscure but because the sky is reliably opaque.
Synthetic aperture radar removes that constraint entirely — and a
growing archive of open, freely downloadable SAR data is now available
to anyone with an internet connection.
SAR operates by transmitting microwave pulses from an airborne or
spaceborne platform, illuminating the surface with its own energy and
recording the complex backscattered return. Because the wavelengths used
— centimeters for C-band, decimeters for L-band — pass through cloud
cover, precipitation, and darkness without attenuation, SAR produces
coherent, repeatable imagery under all atmospheric conditions. More
importantly for 3D reconstruction, SAR records a physically different
quantity than a camera: rather than the color and texture of a surface
as seen from a particular angle in particular lighting, it records the
complex amplitude and phase of the radar return, encoding information
about surface geometry, dielectric properties, and structural coherence
that optical sensors simply cannot see.
The cornerstone of open SAR access is the European Space Agency's
Copernicus Sentinel-1 constellation. The program now operates four
satellites: Sentinel-1A (launched April 2014), Sentinel-1C (December
2024), and Sentinel-1D (November 2025), providing C-band SAR imagery at
resolutions down to 5 meters with swaths up to 400 kilometers. With the
full constellation operational, revisit times over most of the world's
land surface are measured in days. All Sentinel-1 data is free, full,
and open — accessible without registration through the Copernicus Data
Space Ecosystem and mirrored as a cloud-optimized GeoTIFF archive on
Amazon Web Services. Since the program's inception, the former Sentinel
Open Access Hub served nearly 760,000 registered users and disseminated
590 petabytes of data before transitioning to the new ecosystem in 2023.
This represents an essentially inexhaustible, freely available archive
of all-weather radar imagery of the entire Earth's surface, extending
back more than a decade.
The commercial SAR sector has expanded dramatically in parallel.
ICEYE, the Finnish firm that pioneered the smallsat SAR market, now
operates the world's largest commercial SAR constellation — more than 40
satellites as of 2026, including six launched on a SpaceX
Transporter-16 rideshare mission on March 30, 2026, delivering
25-centimeter resolution imagery with sub-3-hour revisit capability over
key regions. In December 2024, ICEYE announced an open data initiative
making portions of its archive available through Amazon Web Services
with no registration and no paywall. Capella Space operates 15
satellites capable of sub-25-centimeter spotlight imagery, and has
demonstrated bistatic collection — two satellites imaging the same scene
simultaneously from different angles — a capability that directly
enables 3D reconstruction from radar. The National Reconnaissance Office
renewed Stage III contracts with Capella, ICEYE US, and Umbra in July
2024 through July 2026, confirming that the intelligence community is
actively operationalizing commercial SAR alongside its own classified
assets.
Open SAR Archives: What Is Available Today
Sentinel-1 (ESA/Copernicus): free, full and open, 5m resolution, all
land surfaces, archive from 2014. Access: dataspace.copernicus.eu and
AWS Registry of Open Data. ICEYE Open Data Initiative: no registration,
no paywall, full archive including GRD, SLC, and COG formats, accessible
via STAC browser and S3. ALOS-2 (JAXA): research datasets available
under application; specialized L-band penetration for vegetation and
subsurface. JAXA's PALSAR-2 global mosaics are publicly available
annually. Alaska Satellite Facility: free access to Sentinel-1, ALOS
PALSAR, and ERS data via Vertex search portal.
What SAR Adds That Photographs Cannot
SAR data contributes to the reconstruction pipeline in ways that are
qualitatively distinct from optical photography. Three capabilities are
particularly significant for OSINT applications.
First, SAR provides absolute geometric constraints that optical
reconstruction lacks. The phase of the SAR return encodes path length
with centimeter precision; when two SAR passes of the same scene are
differenced interferometrically — a technique called InSAR — the
resulting phase difference map directly measures vertical displacement
of the surface between the two acquisitions. Across a city, this yields a
digital elevation model with meter-class accuracy from a single
satellite pass. When stacked over time using persistent scatterer
techniques, InSAR time series can measure building heights, subsidence,
uplift, and structural deformation to millimeter precision. This is the
geometric information that photogrammetry requires many overlapping
photographs to approximate — SAR provides it directly, even from a
single side-looking flight path.
Second, SAR enables coherent change detection (CCD) — the detection
of subtle structural changes between two acquisitions by measuring the
loss of phase coherence between them. When a building is undamaged
between two passes, the complex SAR signal coherence between those
passes is high: the same scatterers return the same phase. When a
building is damaged, collapses, or has its rubble disturbed, coherence
drops sharply. This is a physically grounded, weather-independent,
day-and-night-capable damage signal that optical imagery cannot
replicate.
Third, the temporal density of SAR archives gives it a
change-detection cadence that no photograph corpus can match. Sentinel-1
acquires global coverage every few days regardless of cloud cover. A
Gaussian splatting model trained on sparse optical photography provides
the photorealistic texture and appearance; the SAR time series provides
the continuous structural monitoring layer — flagging when anything in
the scene has physically changed between the two most recent passes.
SAR-Neural Reconstruction Fusion: A January 2026 Landmark
A paper published on arXiv in January 2026 by Da Li, Chen Yao, and
colleagues — "Urban Neural Surface Reconstruction from Constrained
Sparse Aerial Imagery with 3D SAR Fusion" — represents the first
framework to formally combine 3D SAR point clouds with aerial optical
imagery for neural surface reconstruction. The paper's opening framing
is directly relevant to the OSINT context: neural surface reconstruction
from multi-view aerial imagery suffers from geometric ambiguity under
sparse-view conditions — precisely the condition that characterizes the
intelligence-relevant long tail. Their solution: inject radar-derived
geometric priors directly into the neural surface reconstruction
pipeline, using the SAR point cloud to guide structure-aware ray
selection and adaptive sampling. The result is markedly improved
reconstruction accuracy, completeness, and robustness compared with
single-modality (optical only) baselines, under the highly sparse and
oblique-view conditions that characterize the practical limits of
available imagery.
The physical intuition is straightforward, and will be immediately
familiar to anyone with radar engineering experience: building facades,
edges, and window recesses exhibit strong, reproducible scattering
responses in SAR imagery because urban structures are highly efficient
radar reflectors. Double-bounce returns from wall-ground junctions,
trihedral corner reflections from building edges, and distributed
scattering from rough surfaces all produce characteristic signatures.
Where a sparse optical image set may leave a building's geometry
ambiguous, the SAR point cloud derived from even a single collection
pass provides robust geometric anchoring for the neural surface
reconstruction. The two modalities are physically complementary: SAR
sees geometry through the physics of microwave scattering; optical
imagery captures the photorealistic texture and color that SAR cannot
provide.
SAR in Active Conflict: Sentinel-1 as a War Crimes Monitor
The operational precedent for open SAR data in OSINT already exists,
and is substantial. Beginning with the Russian invasion of Ukraine in
February 2022, researchers at Oregon State University and elsewhere
demonstrated that Sentinel-1 InSAR coherent change detection could map
building damage across the entire country from freely available radar
imagery. The resulting paper, published in the journal Science of Remote
Sensing in 2025, produced nationwide damage data with three-month
latency from a purely open-source pipeline. The same team's follow-on
work addressed Gaza, analyzing 321 openly accessible Sentinel-1 SAR
images acquired before and during the conflict to produce more than
3,200 coherence images and track weekly damage trends across 330,079
building footprints. Their long temporal-arc CCD approach detected 92.5%
of damage labels in reference data from the United Nations Satellite
Center with a false positive rate of only 1.2% — a performance level
that provides legally useful evidence from satellites costing nothing to
access.
A parallel study applied Sentinel-1 CCD to the urban areas of
Mariupol and Kharkiv specifically, using the spatial and temporal
fidelity of the coherence time series to track the progression of the
siege of Mariupol and the battle of Kharkiv at the level of individual
city blocks, cross-referenced against cultural property identified in
OpenStreetMap. These results were produced by academic researchers using
free satellite data, open-source processing tools, and cloud computing
platforms — tools accessible to any well-resourced OSINT practitioner,
not merely national intelligence agencies.
VII. The OSINT Ecosystem That Will Use These Tools
Open-source intelligence has undergone its own
revolution over the roughly same period. Bellingcat, founded by British
journalist Eliot Higgins in 2014, demonstrated that rigorous analytical
work combining satellite imagery, social media photographs, shadow
geometry, road sign analysis, and architectural cross-referencing could
geolocate military hardware, document atrocities, and identify war
criminals — work previously requiring dedicated national intelligence
assets. Their investigation of the MH17 shootdown reconstructed the
route of the Buk missile launcher through eastern Ukraine from tourist
Instagram posts and dash-cam videos.
The conflict in Ukraine accelerated the maturation of the OSINT
ecosystem dramatically. Volunteer communities using platforms like
GeoConfirmed geolocate and verify visual content from conflict zones in
near-real time, building interactive verified-event maps with full
geolocation proofs, order-of-battle diagrams, and equipment tracking.
The Conflict Observatory, backed by USAID and using geospatial AI tools,
has documented Russian war crimes from commercially available satellite
imagery and open social media sources for eventual use in legal
proceedings. The HALO Trust has used OSINT and GIS mapping to document
explosive hazard locations across Ukraine, verifying over 8,000 data
points to guide demining operations.
All of these workflows currently depend on 2D analysis: geolocating a
photograph means identifying what can be seen in the image and matching
it to known map features, satellite imagery, or other photographs. What
the new generation of 3D reconstruction tools enables is a qualitative
step beyond that: the ability to synthesize disparate, sparse,
uncoordinated image collections into a coherent volumetric model of a
site, from which analysts can extract viewing angles, measurement data,
change detection over time, and operational planning intelligence that
no single photograph can provide.
An analyst examining a target compound who previously had three
social media photographs, a satellite thumbnail, and a traffic camera
frame — insufficient for classical reconstruction — can now potentially
produce a navigable 3D site model integrating all of that imagery. The
same tools that let an entertainment company reconstruct a concert venue
for virtual reality will let an OSINT researcher reconstruct a
detention facility from a handful of prisoner photographs posted on
social media.
"The same tech shows up in a Cornell research paper, a Netflix VFX
tool, and an intelligence program — usually within months of each
other."
— Bilawal Sidhu, "Building a 3D Model of the World from Internet Photos," Spatial Intelligence, May 2026
VIII. YouTube and the Videogrammetry Frontier
Every argument made so far about still photographs
applies with equal or greater force to video — and the world's video
archive dwarfs its photograph archive by orders of magnitude. YouTube
alone reports more than 500 hours of video uploaded every minute. A
significant fraction of that upload stream consists of travel vlogs,
walking tours, drone footage, dashcam recordings, military equipment
reviews, conflict documentation, and urban exploration content —
precisely the kinds of footage that structure-from-motion algorithms can
exploit for 3D reconstruction. The question of whether anybody has
seriously looked at YouTube as a reconstruction source is easy to
answer: the computer vision community has been doing it for years, and
the results are now operationally relevant.
Videogrammetry — the systematic extraction of 3D geometry from video
streams — traces its academic lineage through the same
structure-from-motion literature as still photography reconstruction. A
2026 survey in the Journal of Imaging identified approximately
6,863 published studies on video-based 3D reconstruction through the end
of 2024, covering structure-from-motion, multi-view stereo, and visual
SLAM approaches applied to monocular video. The fundamental mechanics
are the same as with photographs: each video frame is a photograph, and a
video stream at 30 frames per second taken while the camera moves
provides an extremely dense, continuous multi-view geometry problem with
known temporal ordering — in many ways a cleaner input for
reconstruction than a disorganized pile of tourist snapshots, because
the frame-to-frame motion is smooth and predictable.
The intelligence community appreciated this as a capability even
before the academic literature fully crystallized it. A U.S. Army Corps
of Engineers patent from the late 2010s specifically describes "a method
of processing full motion video data for photogrammetric
reconstruction," noting the military's goal of transforming millions of
terabytes of full motion video (FMV) from unmanned vehicles, wireless
cameras, and other sensors into usable 2D and 3D maps and models. That
work — extracting still frames, recovering camera exterior orientation
from on-board GPS and inertial data, and applying classical
photogrammetric computational models — has been an active DoD program
for over a decade.
What has changed is the democratization and scale. A YouTube walking
tour of a city neighborhood, a drone flyover of a port facility, a
dashcam recording of a road through a military installation's perimeter,
a tourist's GoPro footage of a foreign capital — each of these,
individually, is a structure-from-motion problem that feed-forward
models like VGGT can now solve in seconds. Aggregated across thousands
of videos uploaded by different people at different times, they become a
time-resolved, multi-angle, continuously updated 3D model of the world
with a temporal density no satellite constellation can match.
The mannequin challenge precedent noted in earlier research is
telling: Cornell researchers scraped 2,000 freeze-frame videos from
YouTube in 2016 specifically because the static-human, moving-camera
format was a free structure-from-motion training set with humans
embedded. The same logic applies to any video genre where the camera
moves through a scene while the background is static — which describes
the large majority of travel, urban exploration, and documentary video
content. MegaScenes and MegaDepth-X were built on still photographs from
Wikimedia Commons; there is no technical barrier to building an
equivalent dataset from YouTube at ten times the scale, and the
intelligence community almost certainly has programs doing exactly this
with non-public video archives.
Video as a Reconstruction Source: What Each Format Provides
Walking tour / vlog: dense, continuous camera trajectory at street
level; ideal for building facade reconstruction. Drone footage:
oblique-angle coverage of rooftops and courtyards inaccessible to
street-level photography; complementary to satellite nadir imagery.
Dashcam video: road-corridor reconstruction with high temporal density;
effectively a private Street View. Live-stream / conflict video:
unplanned, often shaky, but geolocation-anchored by background features
and shadows; valuable precisely because it is uncontrolled. Sports
broadcast / stadium footage: multi-camera, known camera positions, high
frame rate; near-perfect reconstruction input. The feed-forward models
that solve still-photo reconstruction in seconds are equally applicable
to extracted video frames, with the additional benefit of temporal
ordering as an additional geometric constraint.
IX. The Corruption Vector: AI-Generated Imagery and the Integrity of the Reconstruction Pipeline
The same generative models that fill coverage gaps
in Skyfall-GS and the WRIVA pipeline introduce a profound vulnerability
to the broader reconstruction ecosystem: the progressive contamination
of the open internet photograph corpus with imagery that has no
geometric ground truth. The question is not merely whether AI-generated
images look convincing to a human viewer. The question is whether they
look convincing to a feature-matching algorithm — and the answer is
deeply problematic, for reasons that are both well-understood and not
yet solved.
AI-generated images produced by diffusion models and GANs share a
characteristic set of geometric failures. A 2024 forensic analysis found
that generated images frequently exhibit inconsistent vanishing point
geometry — projections of parallel lines that fail to converge correctly
in two-dimensional space, particularly in architectural scenes. They
often lack proper parallax: the apparent displacement of objects when
viewed from slightly different positions that is the fundamental signal
structure-from-motion algorithms exploit. They can produce physically
impossible depth cues, inconsistent shadows and lighting, anomalous or
atypical object proportions, and missing occlusion relationships. A
separate 2024 study demonstrated that generative models cannot reliably
replicate projective geometric relationships — the very relationships
that photogrammetric algorithms use as their primary matching signal.
From a structure-from-motion perspective, an AI-generated image of a
building injected into a reconstruction pipeline is not a photograph of a
building from a specific physical viewpoint. It is a plausible
hallucination of what such a photograph might look like, generated by
statistical processes that learned the appearance distribution of
buildings but were not constrained to maintain physical consistency
across viewpoints. Feature matching will fail unpredictably — the
algorithm will attempt to find corresponding points between the
AI-generated image and real photographs, and will either produce no
matches (causing the image to be excluded as an outlier) or, more
dangerously, produce spurious matches that inject incorrect geometric
constraints into the reconstruction, silently corrupting the resulting
3D model. The second failure mode is worse than the first because it is
invisible: the reconstruction appears to succeed, produces a
plausible-looking 3D model, but that model encodes false geometry.
Model Collapse: A Systemic Threat to Reconstruction Training Data
The problem extends beyond individual images. A landmark 2024 paper
in Nature by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, and
colleagues at Oxford, Cambridge, and DeepMind formally characterized
"model collapse" — the process by which AI models trained on
AI-generated data experience compounding information loss across
successive generations. The mechanism is inexorable: AI-generated
content overrepresents the statistical center of the training
distribution and underrepresents the tails. When subsequent models train
on this content, rare but valid examples disappear from the learned
distribution. After multiple generations, the model converges on a
narrow, homogeneous, increasingly unrealistic output distribution. The
authors found this effect to be irreversible under standard training
conditions: once the tails of the distribution have been lost, they
cannot be recovered by further training on contaminated data.
For 3D reconstruction specifically, the tail of the distribution is
precisely what matters for OSINT. The unusual viewpoints, the partially
occluded scenes, the edge-case lighting conditions, the architecturally
ambiguous structures — these are the inputs where the models are most
needed and most likely to fail. If the training corpus for future
versions of VGGT, Ï€³, and their successors is contaminated with
AI-generated photographs that systematically smooth over these hard
cases, the resulting models will be progressively worse at exactly the
long-tail reconstruction problem that MegaDepth-X was designed to solve.
A 2025 computational study tracking year-by-year semantic similarity in
major English-language corpora found an exponential acceleration in
synthetic contamination coinciding with the public release of large
language models beginning in 2018-2022. The same trajectory is playing
out in the image domain.
Deepfake Geography: Weaponized Corruption
Beyond accidental contamination, there is the deliberate adversarial
case. A November 2025 study evaluated the detection of AI-generated
satellite imagery and found that generative models including StyleGAN2
and Stable Diffusion can now produce synthetic satellite images that are
visually convincing at terrain scale, exhibiting the characteristic
large-area texture consistency of real satellite imagery — even though
they encode fictional geography. As the National Geospatial-Intelligence
Agency director noted in 2019, AI-manipulated satellite images
represent a severe national security threat. The implication has
sharpened: if a state actor can inject convincing synthetic satellite
imagery into the open datasets that OSINT practitioners, reconstruction
pipelines, and AI training workflows rely on, they can corrupt the
geometric models derived from those datasets, create false confidence in
phantom infrastructure, and cause intelligence failures grounded in
plausible-looking but physically nonexistent evidence.
The formal research on deepfake geography, pioneered by Bo Zhao and
colleagues at the University of Washington, demonstrated that GANs can
learn the statistical appearance of satellite imagery from one location
and apply it to a different base map — effectively fabricating a
convincing satellite photograph of a place that looks different from how
it actually looks, or that shows infrastructure that does not exist.
Unlike photograph forgery in a narrow sense, this is a statistical
manipulation of the entire texture distribution of a geographic area,
producing results that defeat casual visual inspection and, more
importantly, that can pass standard automated quality filters that look
for obvious artifacts rather than geometric ground truth.
SAR as a Corruption Firewall
This is where the SAR data stream discussed in Section VI becomes not
merely complementary but potentially indispensable as a data integrity
mechanism. An AI-generated photograph has no corresponding SAR return. A
diffusion model hallucinating a building in a location where no
building exists will not produce the double-bounce radar signature, the
persistent scatterer behavior, or the InSAR phase coherence that a real
building of that size and construction would generate in Sentinel-1
data. The physical measurement — microwave backscatter from actual
matter — cannot be fabricated by a model that has never seen physical
matter and has only learned the statistical appearance of photographs.
SAR coherence time series, which can detect structural changes at
millimeter precision, provide a chronological ground truth against which
both claimed changes and newly injected imagery can be validated.
This creates a testable framework for reconstruction pipeline
integrity: any image claiming to depict a specific location and time
should produce feature-consistent matches with the SAR data record for
that location and time. Buildings that appear in photographs but produce
no SAR return, or that appear at dates inconsistent with the SAR change
detection record, are candidates for synthetic fabrication. The fusion
of SAR geometric constraints with optical reconstruction is thus
simultaneously an accuracy improvement and a corruption detection
mechanism — radar physics as a cryptographic ground truth for the
photographic record.
The advances described above address static scenes.
The world, of course, is not static. Researchers are now beginning to
extend the same toolkit to casual handheld video, extracting
four-dimensional reconstructions — geometry plus motion over time — from
single phone video clips. Papers including MOSA and Shape of Motion are
demonstrating the ability to decompose a monocular video into
independently moving objects and their trajectories, not merely the
static scene geometry.
The intelligence implications extend beyond site modeling to activity
pattern analysis. Once the 3D structure of a site has been established,
recurring video from the same location — security camera footage, drone
surveillance, recurring social media posts — can in principle be
registered against that model to track changes in vehicle positions,
construction activity, personnel movements, and equipment staging over
time. The same Gaussian splatting techniques that enable "Wild
Gaussians" — separating a building from its variable lighting conditions
across thousands of photographs taken over years — can be adapted to
track the arrival and departure of specific objects against a stable 3D
background model.
The temporal dimension also works backward. Archived social media
photographs taken years before an event of interest can potentially be
incorporated into a retroactive reconstruction of what a site looked
like at a given time, providing historical baseline models against which
current imagery can be compared to detect changes.
X. Privacy, Civil Liberties, and the Absence of Law
The legal and regulatory frameworks governing these
capabilities lag the technology by years, and the gap is widening. A
tourist photographing a famous landmark and posting the image on
Instagram has consented, at most, to the platform's terms of service —
terms that do not anticipate the use of that photograph as a data point
in an intelligence-grade 3D reconstruction of the surrounding area,
complete with the reflections in sunglasses and the geometry of windows
visible in the background.
No current statute in the United States specifically addresses the
mass aggregation of publicly posted photographs for 3D scene
reconstruction. The Computer Fraud and Abuse Act governs unauthorized
computer access, not the analytical use of public data. The Fourth
Amendment's protections against unreasonable search and seizure have
been construed narrowly in the context of publicly accessible spaces.
The Electronic Communications Privacy Act does not address metadata
aggregation at this scale. At the European level, GDPR's provisions on
biometric data and location data offer somewhat stronger protection, but
enforcement has not been tested against photogrammetric reconstruction
pipelines that do not themselves collect data but analyze existing
public collections.
The asymmetry is significant. State and commercial actors with
computational resources can now build 3D models of any photographed
location on Earth from existing public image archives. Private
individuals whose photographs contribute to those models have no
notification, no opt-out, and no legal recourse under current
frameworks. This is the photogrammetric equivalent of the mass location
tracking disclosed in Carpenter v. United States (2018) — in which the
Supreme Court held that the government's warrantless collection of
cell-site location records violated the Fourth Amendment — but operating
on imagery rather than telecommunications metadata, and currently
without a comparable legal check.
Legal Landscape: Current Gaps
No U.S. statute specifically addresses the use of aggregated publicly
posted photographs for 3D scene reconstruction. GDPR offers partial
protections in Europe for identifiable personal data, but has not been
tested against photogrammetric pipelines. Courts have not addressed
whether a person photographing a public space retains any interest in
that image when it is used as a geometric data point. The Carpenter
precedent on location tracking has not been extended to photographic
data aggregation. Operational security implications for military
personnel and government officials who post photographs in sensitive
areas are serious and undersupported by current policy.
XI. Operational Security Implications
The operational security implications for military
and government personnel are immediate and concrete. A photograph posted
from a sensitive installation — even one in which the installation
itself is not visible — can contribute geometric and positional data to a
reconstruction of the surrounding area. Metadata stripped from the
photograph may still leave exploitable artifacts: the angle of shadows
constrains the latitude, longitude, and time; the geometry of visible
skylines and structures provides cross-referencing anchor points; and
the aggregate of many such photographs taken by many individuals over
time can yield a high-fidelity reconstruction of areas that no single
photograph would reveal.
The Department of Defense has maintained guidance restricting
photography on installations and prohibiting the posting of
operationally sensitive imagery on social media, but enforcement is
uneven and the underlying threat model has not been updated to account
for photogrammetric aggregation of innocuous-seeming photographs. The
same techniques that let Cornell researchers reconstruct a Viennese
palace from tourist Flickr posts can reconstruct the exterior geometry
of any base, facility, or compound from the holiday photographs, fitness
app routes, and social media posts of the personnel who work there.
XII. Looking Forward: A Convergent World Model
The technology now in place constitutes what
researchers have begun calling a "sensorium" — a continuously updated,
increasingly complete, three-dimensional model of the photographed and
radar-scanned world. The pieces have now fully assembled: a planetary
archive of photographs already uploaded and searchable; an equally
planetary archive of free SAR imagery extending back more than a decade
through Sentinel-1 and growing daily; feed-forward 3D foundation models
that can turn any collection of those photographs into a coherent
geometric model in seconds; SAR-neural fusion frameworks that inject
radar-derived geometric priors to constrain and complete what sparse
optical imagery cannot; Gaussian splatting rendering that makes the
resulting fused models interactable in real time in a browser;
diffusion-based in-painting that fills coverage gaps where neither
photographs nor radar provide sufficient detail; and now, with
MegaDepth-X, the ability to extend photogrammetric reconstruction to the
sparse, irregular, noisy image collections that characterize most of
the world's surface.
The four-dimensional extension — tracking change over time from video
— is underway. The fusion of ground-level, drone, and satellite imagery
into unified models is operational. The intelligence community has been
funding precisely this capability through WRIVA since 2023. The
academic community has solved the remaining algorithmic barrier. The
commercial ecosystem — in VFX, video gaming, real estate, autonomous
vehicles, and augmented reality — will drive the cost of the necessary
computation toward zero.
What has not kept pace is the governance framework. The photographic
commons of the internet, accumulated over two decades of social media,
smartphone ubiquity, and platform-enabled sharing, is about to be
transformed into something its contributors did not consent to: the raw
material for a god's-eye view of the world. The technology is ready. The
law is not. The policy is not. And the public, whose photographs make
it possible, is largely unaware that the transformation is underway.
Verified Sources & Formal Citations
- Li, Y., Xiangli, Y., Averbuch-Elor, H., Snavely, N., & Cai, R. (2026). Long-tail Internet photo reconstruction. CVPR 2026. arXiv:2604.22714. https://arxiv.org/abs/2604.22714 | Project page: https://megadepth-x.github.io/
- Wang, J., et al. (2025). VGGT: Visual Geometry Grounded Transformer. CVPR 2025 Best Paper Award. GitHub: https://github.com/facebookresearch/vggt
- Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G.,
Hariharan, B., & Snavely, N. (2024). MegaScenes: Scene-level view
synthesis at scale. ECCV 2024. arXiv:2406.11819. https://arxiv.org/abs/2406.11819 | Dataset: https://megascenes.github.io/
- Office of the Director of National Intelligence / IARPA. (2023,
July 12). IARPA launches effort to develop photorealistic site models
[Press release]. https://www.dni.gov/index.php/newsroom/press-releases/press-releases-2023/3707-iarpa-launches-effort-to-develop-photorealistic-site-models
- IARPA. (2023). WRIVA program page. https://www.iarpa.gov/research-programs/wriva
- Intelligence Community News. (2023, July 13). IARPA launches WRIVA program. https://intelligencecommunitynews.com/iarpa-launches-wriva-program/
- Dodd, T. (2023, July 14). Why US intelligence wants a new way to make virtual, 3D models. Popular Science. https://www.popsci.com/technology/iarpa-virtual-models/
- Lee, J., et al. (2025). Skyfall-GS: Synthesizing immersive 3D urban scenes from satellite imagery. arXiv:2510.15869. https://arxiv.org/abs/2510.15869
- The Decoder staff. (2025, November 2). Skyfall-GS turns satellite images into walkable 3D cities. The Decoder. https://the-decoder.com/skyfall-gs-turns-satellite-images-into-walkable-3d-cities/
- Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G.
(2023). 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (SIGGRAPH 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
- Martin-Brualla, R., Radwan, N., Sajjadi, M. S. M., Barron, J. T.,
Dosovitskiy, A., & Duckworth, D. (2021). NeRF in the Wild: Neural
radiance fields for unconstrained photo collections. CVPR 2021. https://nerf-w.github.io/
- Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,
Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing scenes as
neural radiance fields for view synthesis. ECCV 2020. https://www.matthewtancik.com/nerf
- Wang, S., et al. (2024). DUSt3R: Geometric 3D vision made easy. CVPR 2024. https://dust3r.europe.naverlabs.com/
- Leroy, V., et al. (2025). MASt3R: Matching and Stereo 3D Reconstruction. CVPR 2025. https://github.com/naver/mast3r
- Li, Z., & Snavely, N. (2018). MegaDepth: Learning single-view depth prediction from internet photos. CVPR 2018. https://www.cs.cornell.edu/projects/megadepth/
- ULTRRA Challenge. (2025). WRIVA workshop at WACV 2025. https://sites.google.com/view/ultrra-wacv-2025
- MegaDepth-X dataset (Hugging Face). https://huggingface.co/datasets/y-u-a-n-l-i/MegaDepth-X
- MegaScenes dataset (GitHub / AWS Open Data). https://github.com/MegaScenes/dataset
- Sidhu, B. (2026, May). Building a 3D model of the world from internet photos. Spatial Intelligence (Substack). https://www.spatialintelligence.ai/p/building-a-3d-model-of-the-world
- Chen, S. (2025, September 23). VGGT: How feed-forward 3D perception is redefining scene reconstruction. Medium / deMISTify. https://medium.com/demistify/vggt-how-feed-forward-3d-perception-is-redefining-scene-reconstruction
- Aira, L. S., Facciolo, G., & Ehret, T. (2024). Gaussian
splatting for efficient satellite image photogrammetry.
arXiv:2412.13047. https://arxiv.org/abs/2412.13047
- Aristotle University of Thessaloniki et al. (2023). Integrating
Earth observation IMINT with OSINT data: A case study of the
Ukraine–Russia war. Security and Defence Quarterly. https://securityanddefence.pl/…multisource,170901,0,2.html
- Esri / HALO Trust. (2022). Open-source data documents war atrocities in Ukraine. https://www.esri.com/about/newsroom/blog/ukraine-open-source-intelligence
- GeoConfirmed project page. https://geoconfirmed.org/
- Bellingcat Online Investigation Toolkit (GitBook, current version). https://bellingcat.gitbook.io/toolkit
- State of Surveillance. (2026, January 9). Geolocation OSINT: How investigators find where photos were taken. https://stateofsurveillance.org/articles/technical/geolocation-osint-photo-location-tracking/
- CVPR 2026 poster listing. Long-tail Internet photo reconstruction. https://cvpr.thecvf.com/virtual/2026/poster/37828
- Carpenter v. United States, 585 U.S. 296 (2018). [Cell-site location data / Fourth Amendment precedent.] https://www.supremecourt.gov/opinions/17pdf/16-402_h315.pdf
- Li, D., Yao, C., Mao, T., Bao, J., & Sun, H. (2026). Urban
neural surface reconstruction from constrained sparse aerial imagery
with 3D SAR fusion. arXiv:2601.22045. https://arxiv.org/abs/2601.22045
- Scher, C., & Van Den Hoek, J. (2025). Nationwide conflict
damage mapping with interferometric synthetic aperture radar: A study of
the 2022 Russia–Ukraine conflict. Science of Remote Sensing, 11, 100217. https://doi.org/10.1016/j.srs.2025.100217
- Scher, C., & Van Den Hoek, J. (2025). Active InSAR monitoring
of building damage in Gaza during the Israel-Hamas War.
arXiv:2506.14730. https://arxiv.org/abs/2506.14730
- Building Damage Assessment Portal — open-access Sentinel-1 InSAR CCD data for Ukraine and Gaza. https://rccd-damage-portal.netlify.app/
- Ballinger, O. (2024). Open access battle damage detection via pixel-wise T-Test on Sentinel-1 imagery. arXiv:2405.06323. Published in Remote Sensing of Environment, 2025. https://www.sciencedirect.com/science/article/pii/S0034425725004298
- Mavroulis, S., et al. (2024). Cultural heritage in times of
crisis: Damage assessment in urban areas of Ukraine using Sentinel-1 SAR
data. ISPRS International Journal of Geo-Information, 13(9), 319. https://www.mdpi.com/2220-9964/13/9/319
- ESA Copernicus Sentinel-1 constellation overview. https://sentinels.copernicus.eu/copernicus/sentinel-1
- Copernicus Data Space Ecosystem — Sentinel-1 free access portal. https://dataspace.copernicus.eu/data-collections/copernicus-sentinel-missions/sentinel-1
- Sentinel-1 AWS Open Data Registry (Element 84). https://registry.opendata.aws/sentinel-1/
- ICEYE Open Data Initiative (no-paywall SAR archive on AWS). https://www.iceye.com/open-data-initiative
- Janes. (2026, April 30). ICEYE launches six new SAR satellites [SpaceX Transporter-16 rideshare, March 30, 2026]. https://www.janes.com/osint-insights/defence-news/air/update-iceye-launches-six-new-sar-satellites
- Synthetic Aperture Radar news. (2024, December 17). NRO extends
SAR contracts to Capella, ICEYE, and Umbra, Stage III SCE BAA, July
2024–July 2026. https://syntheticapertureradar.com/nro-extends-sar-contracts-to-capella-iceye-and-umbra-advancing-commercial-radar-strategy/
- Atlas Institute for International Affairs. (2025, August 6). Open
eyes in the high north: OSINT capabilities including Sentinel-1, ICEYE,
and Capella Space. https://atlasinstitute.org/open-eyes-in-the-high-north-open-source-intelligence-capabilities-and-constraints/
- Capella Space. (2025, April 7). How SAR is reshaping the Earth observation industry in 2025. https://www.capellaspace.com/blog/how-sar-is-reshaping-the-earth-observation-industry-in-2025
- Afrosheh, S., & Askari, M. (2024). Geospatial data fusion:
Combining LiDAR, SAR, and optical imagery with AI for enhanced urban
mapping. arXiv:2412.18994. https://arxiv.org/abs/2412.18994
- Mouget, A., et al. (2026). Video-based 3D reconstruction: A review of photogrammetry and visual SLAM approaches. Journal of Imaging, 12(3), 128. https://www.mdpi.com/2313-433X/12/3/128
- U.S. Army Corps of Engineers / USPTO. Method of processing full
motion video data for photogrammetric reconstruction. US Patent
10,553,022. https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10553022
- Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson,
R., & Gal, Y. (2024). AI models collapse when trained on recursively
generated data. Nature, 631, 755–759. https://doi.org/10.1038/s41586-024-07566-y
- Arabi, M., & Jalili, M. (2025). Future of AI models: A computational perspective on model collapse. arXiv:2511.05535. https://arxiv.org/abs/2511.05535
- Sarkar, S., et al. (2024). Qualitative failures of image
generation models and their application in detecting deepfakes.
arXiv:2304.06470. https://arxiv.org/html/2304.06470v6 [vanishing point, parallax, and projective geometry failures in AI-generated images]
- Okumura, R., Shiohara, K., & Yamasaki, T. (2024). ControlVP:
Interactive geometric refinement of AI-generated images with consistent
vanishing points. arXiv:2512.07504. https://arxiv.org/abs/2512.07504 [documents systematic vanishing point inconsistency in Sora, Stable Diffusion outputs]
- Yerzhanuly, M. (2025). Deepfake geography: Detecting AI-generated satellite images. arXiv:2511.17766. https://arxiv.org/abs/2511.17766 [ViT detectors achieve 95.11% accuracy vs. CNN; terrain-level inconsistencies as detection signals]
- Zhao, B., et al. (2021). A growing problem of 'deepfake geography': How AI falsifies satellite images. Cartography and Geographic Information Science. University of Washington news release: https://www.washington.edu/news/2021/04/21/a-growing-problem-of-deepfake-geography-how-ai-falsifies-satellite-images/
- Global Investigative Journalism Network. (2025, December 17). Deepfake geography: How AI can now falsify satellite images. https://gijn.org/stories/deepfake-geography-how-ai-can-now-falsify-satellite-images/