Deploying and Evaluating Robotics Foundation Models: Zero-Shot & Lessons Learned

OpenVLA is an open-source 7B parameter vision-language-action (VLA) model trained on 970k robot episodes from the Open-X Embodiment dataset. It takes a single RGB image and a natural language instruction and outputs high-level robot actions (end-effector pose deltas + gripper command), represented as discrete tokens predicted autoregressively.

Our goal was simple: “Can we deploy OpenVLA on our xArm7 robot, zero-shot, and have it perform real tasks?” The catch: we wanted to do this without any fine-tuning and see what breaks.

1. Physical Setup & Why It Matters

Our setup used an xArm7 robotic arm with a static third-person camera. Tasks included reaching for colored cubes and following simple instructions like "pick up red cylinder".

One key insight from our zero-shot deployment experiments was that OpenVLA only produced meaningful motions when our physical setup closely matched one of the training environments.

Initially, we tried different camera angles and workspace layouts while using the original normalization statistics from OpenVLA's training dataset. Despite correct prompts, the robot often produced random movements or froze mid-task.

Our initial setup using different camera angles and workspace layouts led to random motions.

After analyzing the training dataset, we found a scene to replicate. Here's the training environment we aimed to match:

Target training environment from Open-X dataset

We then replicated this scene as closely as possible, matching:

Camera Viewpoint: Matching angle and height
Table Color: Similar surface tone reduced background confusion
Background: Removed clutter and used a clean backdrop similar to training
Object Placement: Layout resembling training dataset scenes

Our matched setup → meaningful motions

Result: After matching these characteristics, OpenVLA generated semantically correct motions (e.g., reaching toward the correct object and attempting grasping). This highlights how foundation models are sensitive to camera angle, background, and scene composition, and why environment matching is crucial for zero-shot deployment.

Task: "reach orange cube" → successfully follows orange cube

2. Simulation Results

We tested simulation tasks from the LIBERO benchmark, like “put ketchup bottle in basket”. Even without fine-tuning, OpenVLA understood semantics and generated reasonable motions:

Approached target objects reliably.
Multi-step sequences (grasp + transport + place) often failed mid-way.

Example: “Put ketchup bottle in basket” → grasp succeeded, placement succeeded

Example: “Put tomato sauce bottle in basket” → grasp succeeded, placement failed (robot did not move to basket)

Insight: Even in sim, zero-shot models show semantic grounding but lack fine motor reliability.

3. Physical Deployment on xArm7

Zero-shot deployment produced mixed results:

Task: "Pick up gray cube"

Task: "reach the ketchup bottle" → follows ketchup bottle

Task vs Success Rate (zero-shot) – success is task-dependent and viewpoint-sensitive.

Insight: Matching camera and background to training dataset improved success rates.

4. Failure Modes

While OpenVLA produced meaningful motions in certain tasks, we observed several failure modes during zero-shot deployment:

Wrong Object Selection: Prompted to “pick up blue cube” but grasped an orange cube.

Policy Freeze: Robot moved partially, then stopped mid-task with no recovery behavior.

Unreliable Motor Control: Prompted to "Grasp gray cube" but grasped the area aroundthe gray cube

These failure cases often stemmed from viewpoint sensitivity, environment factors, and normalization statistics from the training dataset. Addressing these would likely require fine-tuning or additional sensory inputs.

5. Challenges & Observations

Environment Sensitivity: Even small camera shifts degraded performance.
Normalization Mismatch: Original dataset normalization caused drift; recomputed stats improved stability.
Multi-Step Tasks: Struggled with grasp + placement despite semantic understanding.

Prompt phrasing affected motion behavior.

5. Future Work

Parameter-efficient fine-tuning (e.g., LoRA) with real xArm7 demos to improve precision.
Expand tasks to multi-step manipulations and dynamic scenes.
Benchmark other models (π0-FAST, RT-2) for comparison.

Resources & Reproducibility

To help others reproduce our work and understand the technical details, we've made our code and documentation available:

Technical Documentation

OpenVLA ZeroShot Deployment Documentation - Detailed steps to replicate our physical setup and deployment process
Normalization Statistics Analysis - Step-by-step guide calculating new normalization stats from Open-X dataset
Experiments Tracking - Spreadsheet tracking prompts, and outcomes

6. Conclusion

OpenVLA shows promise as a generalist, open-source robotic policy. Zero-shot, it can generate semantically aligned motions and even succeed at simple reaching tasks. However, precise manipulation and multi-step tasks still require environment matching or fine-tuning.

As robotic foundation models mature, bridging the gap between semantic understanding and physical reliability will be key. Our next steps: fine-tune OpenVLA for xArm7 and test whether small datasets can unlock robust generalization.

Deploying & Evaluating Robotics Foundation Models:Zero-Shot and Lessons Learned