The savage growth period of robot learning data has come to an end.
The era when datasets were everywhere and operated independently is over. By mid-2025, the entire open-source robot ecosystem has formed a clear “tripartite” pattern—Open X-Embodiment (OXE), LeRobot, and InternData-A1—these three ecosystems define the current benchmarks for foundational robot models, and most single datasets before 2023 have long lost their competitiveness.
From Fragmentation to Unification: It’s No Coincidence
Looking at the evolution of robot datasets on Hugging Face Hub reveals that the industry is undergoing a shift from specialized, fragmented collections aimed at institutions to large-scale, standardized, community-driven unified models.
This transformation is neither accidental nor driven by force. Fundamentally, training general-purpose robot strategies requires scale and standardization; from an ecosystem perspective, projects that are well-supported and match mainstream frameworks naturally attract more developers.
The Three Ecosystems Show Their Strengths
OXE: The ImageNet Moment for Robotics
Open X-Embodiment is a consortium launched at the end of 2023 by 34 leading robotics laboratories. It’s not a single dataset but a large integration of over 60 existing datasets under a unified architecture.
Numbers speak:
Over 1 million real-world trajectories
Coverage of 22 different robot forms (from industrial arms to quadrupeds and mobile manipulators)
All data converted into RLDS standard format based on TensorFlow and Apache Arrow
The key takeaway is straightforward—simple operations have already been commoditized. Basic tasks like grasping and placing, opening drawers, or single-arm assembly are “solved” at the data level. This means the era of making money by selling basic remote operation data is over. Future commercial value lies in high-precision expert data, long-term operations in real household environments, or those scarce embodied forms (humanoid, soft-bodied).
LeRobot: The Standard Answer in the PyTorch Camp
Unlike the research route represented by OXE with Google/TensorFlow, Hugging Face’s LeRobot has quickly become the de facto standard for the broader open-source community—especially for PyTorch enthusiasts.
The ecosystem’s killer feature is the complete stack: datasets + models + training code + evaluation tools.
An innovation worth noting is storage: LeRobot Dataset v3.0 uses Apache Parquet + compressed MP4/AV1, improving storage efficiency by 5-10 times and significantly speeding up loading.
Flagship datasets include:
DROID 1.0.1: About 76,000 datasets from over 50 teams, deliberately collected in “outdoor environments” to maximize reflection of real-world variability
Aloha Series: High-precision dual-arm and mobile dual-arm datasets
Practical insight: data delivery standards have permanently shifted to Parquet + MP4. Any commercial provider still using ROS packages or raw videos is effectively adding unnecessary technical burdens for customers.
The “Counterattack” of Synthetic Data: InternData-A1
The third force comes from large-scale high-fidelity synthetic data. Shanghai AI Laboratory’s InternData-A1 represents the latest progress in this direction:
Scale: 630,000 trajectories, equivalent to 7,433 hours of robot work
Physical Diversity: Not just rigid objects, but also articulated objects, fluids, particles, and deformable materials (cloth, ropes, etc.)
But here’s a critical turning point—while synthetic data is improving, it’s not omnipotent.
A comprehensive survey in October 2025 found that despite significant engineering progress, the core differences between simulation and reality have not been eliminated, only compressed into narrower but still crucial domains.
Main challenges include:
Dynamics Gap: Even the best physics engines in 2025 struggle with chaotic phenomena, deformable objects, thin-shell objects (like fabric bending and wrinkle memory), and numerical integration errors. Strategies that work well in simulation may fail in real, contact-rich tasks.
Perception and Sensing Gap: Although synthetic rendering has achieved photo-realism, systematic artifacts still exist—imperfect camera defect models, lack of subsurface scattering, glare effects, dust, etc.
Execution and Control Gap: Real robots have hidden controllers that drift over time, requiring fine-tuning for each individual.
Systemic Environmental Gap: Safety controllers, communication delays, unmodeled floor compliance are difficult to accurately reproduce in simulation.
Data shows that current foundational models (RT-2-X, Octo, etc.) often see success rates drop by 40-80% when transferred from simulation to real robots, especially in deformable, contact-rich, and long-horizon tasks.
In Reality, Real Data Has Not Been Eliminated
Despite progress in large-scale domain randomization, residual modeling, and hybrid training (90-99% synthetic + 1-10% real), the bottom line in 2025 is: zero-shot sim-to-real transfer remains limited to moderately complex rigid-body tasks and controlled environments.
For applications involving deformable objects, fluids, high-precision assembly, or unstructured household operations, real-world data—especially high-quality expert demonstrations—still holds irreplaceable value.
What does this mean for data providers? The business opportunities from 2026 to 2028 lie in hybrid schemes that combine large-scale synthetic data with carefully selected real trajectories, particularly in “challenging” domains (cloth, liquids, cluttered scenes, multi-step reasoning). Pure synthetic data will not be sufficient to support production-level deployment in the foreseeable future.
Postscript: From “Which Dataset” to “How to Mix”
The convergence of OXE, LeRobot, and InternData-A1 marks the true end of the era of data fragmentation in robot learning. We have entered a “post-dataset” phase, where the key questions are no longer:
How to most effectively blend real, synthetic, and distilled data?
How should meta-data be designed to survive model distillation?
Which embodied and physical phenomena remain critical bottlenecks?
The winners in the next 2-3 years will be those who can produce high-quality, standardized data while maintaining an advantage in collecting real data in the increasingly narrow “challenging” domains.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
The era of "tripartite dominance" in robot data has arrived; fragmentation is a thing of the past.
The savage growth period of robot learning data has come to an end.
The era when datasets were everywhere and operated independently is over. By mid-2025, the entire open-source robot ecosystem has formed a clear “tripartite” pattern—Open X-Embodiment (OXE), LeRobot, and InternData-A1—these three ecosystems define the current benchmarks for foundational robot models, and most single datasets before 2023 have long lost their competitiveness.
From Fragmentation to Unification: It’s No Coincidence
Looking at the evolution of robot datasets on Hugging Face Hub reveals that the industry is undergoing a shift from specialized, fragmented collections aimed at institutions to large-scale, standardized, community-driven unified models.
This transformation is neither accidental nor driven by force. Fundamentally, training general-purpose robot strategies requires scale and standardization; from an ecosystem perspective, projects that are well-supported and match mainstream frameworks naturally attract more developers.
The Three Ecosystems Show Their Strengths
OXE: The ImageNet Moment for Robotics
Open X-Embodiment is a consortium launched at the end of 2023 by 34 leading robotics laboratories. It’s not a single dataset but a large integration of over 60 existing datasets under a unified architecture.
Numbers speak:
The key takeaway is straightforward—simple operations have already been commoditized. Basic tasks like grasping and placing, opening drawers, or single-arm assembly are “solved” at the data level. This means the era of making money by selling basic remote operation data is over. Future commercial value lies in high-precision expert data, long-term operations in real household environments, or those scarce embodied forms (humanoid, soft-bodied).
LeRobot: The Standard Answer in the PyTorch Camp
Unlike the research route represented by OXE with Google/TensorFlow, Hugging Face’s LeRobot has quickly become the de facto standard for the broader open-source community—especially for PyTorch enthusiasts.
The ecosystem’s killer feature is the complete stack: datasets + models + training code + evaluation tools.
An innovation worth noting is storage: LeRobot Dataset v3.0 uses Apache Parquet + compressed MP4/AV1, improving storage efficiency by 5-10 times and significantly speeding up loading.
Flagship datasets include:
Practical insight: data delivery standards have permanently shifted to Parquet + MP4. Any commercial provider still using ROS packages or raw videos is effectively adding unnecessary technical burdens for customers.
The “Counterattack” of Synthetic Data: InternData-A1
The third force comes from large-scale high-fidelity synthetic data. Shanghai AI Laboratory’s InternData-A1 represents the latest progress in this direction:
The Reality Gap: The Ceiling of Synthetic Data
But here’s a critical turning point—while synthetic data is improving, it’s not omnipotent.
A comprehensive survey in October 2025 found that despite significant engineering progress, the core differences between simulation and reality have not been eliminated, only compressed into narrower but still crucial domains.
Main challenges include:
Dynamics Gap: Even the best physics engines in 2025 struggle with chaotic phenomena, deformable objects, thin-shell objects (like fabric bending and wrinkle memory), and numerical integration errors. Strategies that work well in simulation may fail in real, contact-rich tasks.
Perception and Sensing Gap: Although synthetic rendering has achieved photo-realism, systematic artifacts still exist—imperfect camera defect models, lack of subsurface scattering, glare effects, dust, etc.
Execution and Control Gap: Real robots have hidden controllers that drift over time, requiring fine-tuning for each individual.
Systemic Environmental Gap: Safety controllers, communication delays, unmodeled floor compliance are difficult to accurately reproduce in simulation.
Data shows that current foundational models (RT-2-X, Octo, etc.) often see success rates drop by 40-80% when transferred from simulation to real robots, especially in deformable, contact-rich, and long-horizon tasks.
In Reality, Real Data Has Not Been Eliminated
Despite progress in large-scale domain randomization, residual modeling, and hybrid training (90-99% synthetic + 1-10% real), the bottom line in 2025 is: zero-shot sim-to-real transfer remains limited to moderately complex rigid-body tasks and controlled environments.
For applications involving deformable objects, fluids, high-precision assembly, or unstructured household operations, real-world data—especially high-quality expert demonstrations—still holds irreplaceable value.
What does this mean for data providers? The business opportunities from 2026 to 2028 lie in hybrid schemes that combine large-scale synthetic data with carefully selected real trajectories, particularly in “challenging” domains (cloth, liquids, cluttered scenes, multi-step reasoning). Pure synthetic data will not be sufficient to support production-level deployment in the foreseeable future.
Postscript: From “Which Dataset” to “How to Mix”
The convergence of OXE, LeRobot, and InternData-A1 marks the true end of the era of data fragmentation in robot learning. We have entered a “post-dataset” phase, where the key questions are no longer:
The winners in the next 2-3 years will be those who can produce high-quality, standardized data while maintaining an advantage in collecting real data in the increasingly narrow “challenging” domains.