Xiaomi has announced the open-source release of its first-generation Vision-Language-Action model, Xiaomi-Robotics-0, marking a notable step forward in embodied artificial intelligence. The model contains 4.7 billion parameters and is designed to combine high-level visual-language understanding with real-time robotic execution.
Unlike earlier VLA systems that often struggle with latency and fragmented motion, Xiaomi-Robotics-0 focuses on real-time responsiveness in physical environments. The model introduces a hybrid architecture that separates perception and reasoning from motor execution, allowing each component to operate at optimal speed and precision.
A key technical innovation is the asynchronous inference mechanism, which decouples model reasoning from robotic control. This enables robots to continue moving smoothly even while new inferences are being computed, avoiding the pauses and jitter commonly seen in real-world deployments.
In testing, the model demonstrated strong performance across multiple simulation benchmarks and completed complex physical tasks such as towel folding and block disassembly using a dual-arm robotic system. The results suggest improved generalization across both rigid and deformable objects.
By open-sourcing the model, Xiaomi aims to contribute to the broader robotics research ecosystem and reduce barriers for real-world experimentation with embodied AI on consumer-grade hardware.
Get in Touch 


