InstaDrive, proposed by SJTU researchers, addresses autonomous driving’s tedious annotation and long-tail data issues. It projects 3D vehicle bounding boxes and BEV map elements into 2D instance segmentation as control conditions, ensuring multi-view consistency via unified occlusion modeling and an order-invariant encoder. Tested on nuScenes, it outperforms baselines in FID and mAP, enabling precise editing of vehicles/map elements and efficient labeled data generation.
Autonomous driving models rely heavily on massive, high-quality labeled data—but manual annotation is tedious, and real-world data often suffers from long-tail distribution (rare but critical scenarios are scarce). Worse, generating multi-view street scenes (a must for surround-view cameras) often fails to maintain consistency across perspectives.
Enter InstaDrive—a novel method proposed by researchers from Shanghai Jiao Tong University that solves these pain points with innovative control logic and efficient generation. Let’s dive into its core innovations and impressive results!
Core Innovations: Reimagining Control for Street View Generation
InstaDrive's strength lies in unifying physical scene information into intuitive, consistent control conditions—here's how it breaks new ground:
1. 2D Instance Segmentation: The "Single Source of Truth" for Multi-View Consistency
Instead of disjoint control signals, InstaDrive projects 3D vehicle bounding boxes and BEV (Bird's-Eye View) vectorized map elements (e.g., lane lines, crosswalks) into a single 2D instance segmentation map. This ensures that all perspectives share the same spatial constraints—eliminating inconsistencies like mismatched lane line colors or misplaced vehicles across front/back views.
2. Unified Occlusion Modeling in 2D
To mimic real-world visibility, InstaDrive layers projections: map elements first, then vehicles ordered from far to near. Closer vehicles naturally occlude distant ones in the 2D map, making the generated scenes physically plausible without extra computation.
3. Order-Invariant Instance Encoder
The model’s custom encoder ignores the order of instance IDs (e.g., swapping IDs of two lane lines). By using max-pooling (inspired by PointNet), it ensures consistent encoding regardless of how instances are ordered—critical for reliable, repeatable generation.
4. Targeted Editing for Corner Cases
Need to test a scenario with no vehicles, missing lane lines, or specific obstacle placements? InstaDrive lets you edit vehicle/map element positions directly via input controls—enabling proactive creation of rare "corner cases" that are hard to capture in real data.
Experimental Results: Proven Superiority on nuScenes
Tested on the nuScenes dataset (850 training scenarios, 150 validation scenarios), InstaDrive outperforms state-of-the-art baselines (MagicDrive, DriveDreamer, Panacea, etc.) in both quality and control:
FID Score: Achieves a lower FID (13.47) than baselines (ranging from 14.9 to 25.54), indicating generated images are closer to real street views.
When using MapTR (a leading HD map detection model) to evaluate generated data:
InstaDrive’s total mAP (mean Average Precision) reaches 0.189 (vs. MagicDrive’s 0.165) when matched to ground truth.
It also maintains better alignment with real-world data (total mAP = 0.122 vs. MagicDrive’s 0.108), proving its ability to preserve accurate map structures.
Removing all vehicle bounding boxes from input results in vehicle-free scenes (ground elements intact).
Erasing a single lane line instance leads to consistent removal of that line across all six surround-view perspectives—no mismatches!
Why It Matters for Autonomous Driving
InstaDrive isn't just a generation tool—it's a data engine for autonomous driving:
Reduces Annotation Burden: Generates large-scale labeled data automatically, cutting down manual work.
Covers Long-Tail Scenarios: Enables targeted creation of rare but critical cases (e.g., construction zones, unusual lane configurations).
Guarantees Multi-View Consistency: Solves a key pain point for surround-view camera systems, making generated data usable for real-world model training.
Future Outlook
With its efficient, controllable design, InstaDrive paves the way for scalable, high-quality training data in autonomous driving. As generative models evolve, this method could integrate more dynamic elements (e.g., moving pedestrians, changing weather) to further enhance real-world relevance.
For researchers and engineers in autonomous driving, InstaDrive offers a practical solution to data scarcity and consistency—proving that unifying physical constraints with generative AI is the way forward.
This paper"InstaDrive: Street View Generation Based on the Unified Instance Segmentation Input of Vehicles and Map Elements" was published in Robot Learning .
Citation: Wang Q, Wang Y, Wang H. InstaDrive: street view generation based on the unified instance segmentation input of vehicles and map elements. Robot Learn. 2026(1):0004, https://doi.org/10.55092/rl20260004.
Robot Learning
Experimental study
Not applicable
InstaDrive: Street View Generation Based on the Unified Instance Segmentation Input of Vehicles and Map Elements
26-Jan-2026