Min Aung Paing, USC SLURM Lab
OpenVLA is an open-source 7B parameter vision-language-action (VLA) model trained on 970k robot episodes from the Open-X Embodiment dataset. It takes a single RGB image and a natural language instruction and outputs high-level robot actions (end-effector pose deltas + gripper command), represented as discrete tokens predicted autoregressively.
Our goal was simple: “Can we deploy OpenVLA on our xArm7 robot, zero-shot, and have it perform real tasks?” The catch: we wanted to do this without any fine-tuning and see what breaks.
Our setup used an xArm7 robotic arm with a static third-person camera. Tasks included reaching for colored cubes and following simple instructions like "pick up red cylinder".
One key insight from our zero-shot deployment experiments was that OpenVLA only produced meaningful motions when our physical setup closely matched one of the training environments.
Initially, we tried different camera angles and workspace layouts while using the original normalization statistics from OpenVLA's training dataset. Despite correct prompts, the robot often produced random movements or froze mid-task.
Our initial setup using different camera angles and workspace layouts led to random motions.
After analyzing the training dataset, we found a scene to replicate. Here's the training environment we aimed to match:
Target training environment from Open-X dataset
We then replicated this scene as closely as possible, matching:
Our matched setup → meaningful motions
Result: After matching these characteristics, OpenVLA generated semantically correct motions (e.g., reaching toward the correct object and attempting grasping). This highlights how foundation models are sensitive to camera angle, background, and scene composition, and why environment matching is crucial for zero-shot deployment.
Task: "reach orange cube" → successfully follows orange cube
We tested simulation tasks from the LIBERO benchmark, like “put ketchup bottle in basket”. Even without fine-tuning, OpenVLA understood semantics and generated reasonable motions:
Example: “Put ketchup bottle in basket” → grasp succeeded, placement succeeded
Example: “Put tomato sauce bottle in basket” → grasp succeeded, placement failed (robot did not move to basket)
Insight: Even in sim, zero-shot models show semantic grounding but lack fine motor reliability.
Zero-shot deployment produced mixed results:
Task: "Pick up gray cube"
Task: "Pick up red cylinder" → did not produce meaningful motion
Task: "reach the ketchup bottle" → follows ketchup bottle
Task vs Success Rate (zero-shot) – success is task-dependent and viewpoint-sensitive.
Insight: Matching camera and background to training dataset improved success rates.
While OpenVLA produced meaningful motions in certain tasks, we observed several failure modes during zero-shot deployment:
Wrong Object Selection: Prompted to “pick up blue cube” but grasped an orange cube.
Policy Freeze: Robot moved partially, then stopped mid-task with no recovery behavior.
Unreliable Motor Control: Prompted to "Grasp gray cube" but grasped the area aroundthe gray cube
These failure cases often stemmed from viewpoint sensitivity, environment factors, and normalization statistics from the training dataset. Addressing these would likely require fine-tuning or additional sensory inputs.
To help others reproduce our work and understand the technical details, we've made our code and documentation available:
OpenVLA shows promise as a generalist, open-source robotic policy. Zero-shot, it can generate semantically aligned motions and even succeed at simple reaching tasks. However, precise manipulation and multi-step tasks still require environment matching or fine-tuning.
As robotic foundation models mature, bridging the gap between semantic understanding and physical reliability will be key. Our next steps: fine-tune OpenVLA for xArm7 and test whether small datasets can unlock robust generalization.