← Back to all projects

Enhancing 2D VLA Models with 3D Spatial Awareness

Timeline: Fall 2025 Role: Robot Learning & Perception Researcher Team: 2 Members Focus: 3D Spatial Reasoning for VLA Models
Vision-Language-Action (VLA) 3D Diffusion Policy DINOv2 Depth Anything v3 Robot Learning MuJoCo
DINOv2 Pipeline Architecture
Ground Truth vs Predicted Depth Comparison
Task Rollout Success Rates

Research Objective

Investigate the practicality of converting single-view 2D RGB inputs into rich 3D scene representations for Vision-Language-Action (VLA) models. This work explores whether "software-only 3D" (pseudo-depth) can replace physical depth sensors, evaluates the performance gap between predicted and ground-truth 3D cues, and tests if VLAs can achieve superior geometric reasoning through lightweight spatial feature injection.

Key Features Achieved

  • Dual Pipeline Architecture: Implemented two distinct methodologies: (1) generating pseudo-3D point clouds via Depth Anything v3 to drive a 3D Diffusion Policy, and (2) injecting spatially-aware semantic features from DINOv2 backbones.
  • Pseudo-3D Scene Reconstruction: Developed a pipeline to unproject monocular depth maps into 3D point clouds, enabling 3D-native policies to operate on standard RGB video data.
  • Semantic Feature Injection: Successfully utilized DINOv2 as a high-resolution feature encoder, achieving superior performance on dynamic tasks without explicit 3D reconstruction.
  • Cross-Resolution Evaluation: Benchmarked model robustness across multiple input resolutions (128×128 to 512×512) and point cloud densities.

Challenges & Learnings

  • Monocular Depth Artifacts: Observed that artifacts in high-resolution monocular depth estimation can cause performance degradation in geometrically complex tasks compared to lower-resolution, smoother predictions.
  • Geometric Precision Gap: While software-only 3D excel at planar tasks, high-precision contact tasks like "Assembly" remain sensitive to the fidelity of predicted depth.
  • Overfitting Risks: Identified a tendency for policies to overfit to perfect simulator depth, necessitating the development of noise-robust representations for real-world transfer.
  • Computational Efficiency: Learned the trade-offs between dense 3D representation and lightweight semantic cueing for real-time robot control.

Quantitative Results