← Back to all projects

Enhancing 2D VLA Models with 3D Spatial Awareness

Timeline: Fall 2025 Role: Robot Learning & Perception Researcher Team: 2 Members Focus: 3D Spatial Reasoning for VLA Models

Vision-Language-Action (VLA) 3D Diffusion Policy DINOv2 Depth Anything v3 Robot Learning MuJoCo

Ground Truth vs Predicted Depth Comparison

Full Report → Poster → GitHub →

Research Objective

Investigate the practicality of converting single-view 2D RGB inputs into rich 3D scene representations for Vision-Language-Action (VLA) models. This work explores whether "software-only 3D" (pseudo-depth) can replace physical depth sensors, evaluates the performance gap between predicted and ground-truth 3D cues, and tests if VLAs can achieve superior geometric reasoning through lightweight spatial feature injection.

Key Features Achieved

Dual Pipeline Architecture: Implemented two distinct methodologies: (1) generating pseudo-3D point clouds via Depth Anything v3 to drive a 3D Diffusion Policy, and (2) injecting spatially-aware semantic features from DINOv2 backbones.
Pseudo-3D Scene Reconstruction: Developed a pipeline to unproject monocular depth maps into 3D point clouds, enabling 3D-native policies to operate on standard RGB video data.
Semantic Feature Injection: Successfully utilized DINOv2 as a high-resolution feature encoder, achieving superior performance on dynamic tasks without explicit 3D reconstruction.
Cross-Resolution Evaluation: Benchmarked model robustness across multiple input resolutions (128×128 to 512×512) and point cloud densities.

Challenges & Learnings

Monocular Depth Artifacts: Observed that artifacts in high-resolution monocular depth estimation can cause performance degradation in geometrically complex tasks compared to lower-resolution, smoother predictions.
Geometric Precision Gap: While software-only 3D excel at planar tasks, high-precision contact tasks like "Assembly" remain sensitive to the fidelity of predicted depth.
Overfitting Risks: Identified a tendency for policies to overfit to perfect simulator depth, necessitating the development of noise-robust representations for real-world transfer.
Computational Efficiency: Learned the trade-offs between dense 3D representation and lightweight semantic cueing for real-time robot control.

Quantitative Results

Planar Task Success: Achieved a 100% success rate on the "Plate-Slide-Side" task using high-resolution monocular depth.
Dynamic Task Performance: Demonstrated an 85% success rate on "Hammer" and 80% on "Basketball" tasks using the DINOv2 semantic feature pipeline.
Resolution Stability: Maintained a 90% success rate on planar tasks even at low resolution (128×128), highlighting the efficacy of lightweight 3D cues.
Benchmark Comparison: Evaluated against native 3D Diffusion Policy (DP3) baselines to quantify the "software-only" performance gap.