From Backflips to Folding Laundry: How X Square Robot Is Building the Missing 'Brain' for Embodied AI

The Real Challenge Isn't Hardware Anymore

While robotics companies around the world continue to showcase humanoids performing backflips, running obstacle courses, and dancing on stage, one Chinese firm is pursuing a more difficult and arguably more consequential goal: teaching robots to operate in the messy, unpredictable environments where people actually live and work. According to X Square Robot founder and CEO Wang Qian, the industry's hardware foundations are largely in place. Humanoid locomotion, dexterous hands, and force-control systems have all advanced rapidly. The remaining challenge is intelligence. "The hardware is largely there," Wang said. "The real bottleneck is the brain."

X Square Robot's Three-Part Strategy

To address that gap, X Square Robot has open-sourced three technologies over the past several weeks: Wall-OSS-0.5, a Vision-Language-Action (VLA) model WALL-WM, a World Action Model designed to understand physical events XRZero-G0, a robot-free data collection and training framework aimed at dramatically reducing data costs Together, these projects target some of the biggest challenges in embodied AI.

Can Pretraining Teach Robots Real Skills?

VLA models have become one of embodied AI's dominant approaches, but a fundamental question remains: does pretraining itself teach robots useful skills, or is it merely preparation for task-specific fine-tuning? Wall-OSS-0.5 was designed to answer that question. Rather than evaluating a fine-tuned model, X Square Robot deployed the pretrained model directly on physical robots and tested it across 17 real-world tasks. The system achieved strong zero-shot performance in object sorting, ring stacking, and even deformable-object manipulation.

From Backflips to Fo...

Click to play

Inside Wall-OSS-0.5's Unified Learning Architecture

At the core of the model is a "gradient-bridged" training framework. Instead of separating perception and control into different modules, Wall-OSS-0.5 converts robot actions into action tokens that are learned alongside language and visual representations during pretraining. This allows perception, language understanding, and action generation to evolve within a unified model. The company found that action training not only improved manipulation ability but also enhanced visual grounding performance, suggesting that physical interaction can strengthen a model's understanding of the world.

From Backflips to Folding Laundry: How X Square Robot Is Building the Missing 'Brain' for Embodied AI paragraph media

Why Imitation Learning Isn't Enough

While Wall-OSS-0.5 demonstrated the promise of VLA pretraining, X Square Robot believes imitation alone is not enough. Most VLA systems learn action trajectories but do not truly understand physical cause and effect. They can repeat behaviors seen during training but often struggle when confronted with unfamiliar situations. To address this limitation, the company introduced WALL-WM.

From Backflips to Fo...

Click to play

WALL-WM: Teaching Robots How the World Works

WALL-WM is a World Action Model that shifts learning from fixed action sequences to meaningful physical events such as reaching, grasping, lifting, and placing. Unlike traditional architectures that separate perception, language, and control, WALL-WM aligns visual observations, language descriptions, and actions around real-world events. The goal is to enable robots not only to act, but also to predict outcomes, reason about physical changes, and adapt when plans fail. According to the company, this approach represents a step toward robots that learn from experience and continuously improve their understanding of the physical world.

Solving Embodied AI's Data Bottleneck with XRZero-G0

If world models are the brain, data remains the fuel. Collecting high-quality robot demonstrations is expensive, time-consuming, and difficult to scale. X Square Robot's answer is XRZero-G0, a hardware-software framework for robot-free data collection and training. The system combines wearable interfaces, multi-view sensing, automated quality inspection, and real-robot validation to improve data quality while lowering collection costs. Through controlled experiments, X Square Robot found that combining ten robot-free demonstrations with one real-robot demonstration could achieve performance comparable to datasets built entirely from real-robot data. The company has also released more than 2,000 hours of multimodal data covering roughly 3,000 tasks to support broader research in embodied AI.

Building the Missing Infrastructure for Embodied Intelligence

Together, the three releases address some of the most important challenges facing embodied AI. Wall-OSS-0.5 explores whether pretraining can directly produce transferable robot skills. WALL-WM examines how robots can model and reason about the physical world. XRZero-G0 tackles the data bottleneck that underpins both approaches. Taken together, they form a full-stack framework spanning data, world models, and robot foundation models. For Wang, the industry's defining moment may be closer than many expect. The challenge is no longer teaching robots how to move, but teaching them how to understand the world they navigate. "The Aha Moment for embodied intelligence," he said, "may be much closer than people think." GitHub pages: Wall-OSS-0.5: https://github.com/X-Square-Robot/wall-x Wall-WM: https://github.com/X-Square-Robot/wall-x XRZero-G0: https://github.com/X-Square-Robot/XRZero-G0

From Backflips to Fo...

Click to play

Share this story

LinkedIn Twitter Facebook WhatsApp Reddit Email

From Backflips to Folding Laundry: How X Square Robot Is Building the Missing 'Brain' for Embodied AI

The Real Challenge Isn't Hardware Anymore

X Square Robot's Three-Part Strategy

Can Pretraining Teach Robots Real Skills?

Inside Wall-OSS-0.5's Unified Learning Architecture

Why Imitation Learning Isn't Enough

WALL-WM: Teaching Robots How the World Works

Solving Embodied AI's Data Bottleneck with XRZero-G0

Building the Missing Infrastructure for Embodied Intelligence

Sources

Learn More About This Robot