From Backflips to Folding Laundry: How X Square Robot Is Building the Missing 'Brain' for Embodied AI
Robot Details
Robotics & AI News • OriginOfBotsPublished
June 17, 2026
Reading Time
4 min read
Author
Origin Of Bots Editorial Team

The Real Challenge Isn't Hardware Anymore
While robotics companies around the world continue to showcase humanoids performing backflips, running obstacle courses, and dancing on stage, one Chinese firm is pursuing a more difficult and arguably more consequential goal: teaching robots to operate in the messy, unpredictable environments where people actually live and work. According to X Square Robot founder and CEO Wang Qian, the industry's hardware foundations are largely in place. Humanoid locomotion, dexterous hands, and force-control systems have all advanced rapidly. The remaining challenge is intelligence. "The hardware is largely there," Wang said. "The real bottleneck is the brain."
X Square Robot's Three-Part Strategy
To address that gap, X Square Robot has open-sourced three technologies over the past several weeks: Wall-OSS-0.5, a Vision-Language-Action (VLA) model WALL-WM, a World Action Model designed to understand physical events XRZero-G0, a robot-free data collection and training framework aimed at dramatically reducing data costs Together, these projects target some of the biggest challenges in embodied AI.
Can Pretraining Teach Robots Real Skills?
VLA models have become one of embodied AI's dominant approaches, but a fundamental question remains: does pretraining itself teach robots useful skills, or is it merely preparation for task-specific fine-tuning? Wall-OSS-0.5 was designed to answer that question. Rather than evaluating a fine-tuned model, X Square Robot deployed the pretrained model directly on physical robots and tested it across 17 real-world tasks. The system achieved strong zero-shot performance in object sorting, ring stacking, and even deformable-object manipulation.

Inside Wall-OSS-0.5's Unified Learning Architecture
At the core of the model is a "gradient-bridged" training framework. Instead of separating perception and control into different modules, Wall-OSS-0.5 converts robot actions into action tokens that are learned alongside language and visual representations during pretraining. This allows perception, language understanding, and action generation to evolve within a unified model. The company found that action training not only improved manipulation ability but also enhanced visual grounding performance, suggesting that physical interaction can strengthen a model's understanding of the world.

Why Imitation Learning Isn't Enough
While Wall-OSS-0.5 demonstrated the promise of VLA pretraining, X Square Robot believes imitation alone is not enough. Most VLA systems learn action trajectories but do not truly understand physical cause and effect. They can repeat behaviors seen during training but often struggle when confronted with unfamiliar situations. To address this limitation, the company introduced WALL-WM.

WALL-WM: Teaching Robots How the World Works
WALL-WM is a World Action Model that shifts learning from fixed action sequences to meaningful physical events such as reaching, grasping, lifting, and placing. Unlike traditional architectures that separate perception, language, and control, WALL-WM aligns visual observations, language descriptions, and actions around real-world events. The goal is to enable robots not only to act, but also to predict outcomes, reason about physical changes, and adapt when plans fail. According to the company, this approach represents a step toward robots that learn from experience and continuously improve their understanding of the physical world.

Solving Embodied AI's Data Bottleneck with XRZero-G0
If world models are the brain, data remains the fuel. Collecting high-quality robot demonstrations is expensive, time-consuming, and difficult to scale. X Square Robot's answer is XRZero-G0, a hardware-software framework for robot-free data collection and training. The system combines wearable interfaces, multi-view sensing, automated quality inspection, and real-robot validation to improve data quality while lowering collection costs. Through controlled experiments, X Square Robot found that combining ten robot-free demonstrations with one real-robot demonstration could achieve performance comparable to datasets built entirely from real-robot data. The company has also released more than 2,000 hours of multimodal data covering roughly 3,000 tasks to support broader research in embodied AI.

Building the Missing Infrastructure for Embodied Intelligence
Together, the three releases address some of the most important challenges facing embodied AI. Wall-OSS-0.5 explores whether pretraining can directly produce transferable robot skills. WALL-WM examines how robots can model and reason about the physical world. XRZero-G0 tackles the data bottleneck that underpins both approaches. Taken together, they form a full-stack framework spanning data, world models, and robot foundation models. For Wang, the industry's defining moment may be closer than many expect. The challenge is no longer teaching robots how to move, but teaching them how to understand the world they navigate. "The Aha Moment for embodied intelligence," he said, "may be much closer than people think." GitHub pages: Wall-OSS-0.5: https://github.com/X-Square-Robot/wall-x Wall-WM: https://github.com/X-Square-Robot/wall-x XRZero-G0: https://github.com/X-Square-Robot/XRZero-G0

Learn More About This Robot
Discover detailed specifications, reviews, and comparisons for Robotics & AI News.
View Robot Details →