Data flywheel? Repeated samples? Robots should say goodbye to the "hours worship".

Question

Formerly an adjunct professor at the University of Toronto and currently at Georgia Tech, robotics researcher Animesh Garg, in an article titled "Moneyball for Physical AI," compares the data competition in embodied intelligence to the "Moneyball" moment in baseball history.

He seeks to challenge an increasingly common funding narrative: that robotics companies only need to accumulate more teleoperation, more real-world deployments, and more operational hours to form a data flywheel. For investors, this is not an academic quibble. The cost structure, commercialization speed, and model moats of embodied intelligence companies are often packaged into the term "data closed loop." If cumulative hours do not equate to effective model progress, the market needs to reassess these companies' data assets.

"Data Hours" May Be the Batting Average Superstition of the Robotics Industry

Garg borrows the classic analogy from Moneyball. In 2002, the Oakland Athletics won 103 games with one of the lowest payrolls in the league, not by buying more expensive players, but by discovering the market had mispriced player value. Traditional scouts valued batting average, stolen bases, and swing mechanics, but the metric that better explained a team's scoring ability was on-base percentage.

In his view, Physical AI may be in a similar phase. The industry acknowledges that data is essential for general-purpose robot models, but it easily mistakes the most displayable metrics for the most important ones: accumulated teleoperation hours, number of demonstration trajectories, number of deployed robots, and operational hours in production scenarios.

The supply of robot data differs from that of text data. Large language models can obtain massive amounts of low-cost text from the internet, code repositories, books, and web pages, with bottlenecks mainly in compute, cleaning, and training efficiency. Robot models require data that involves physical interaction, action feedback, and environmental changes. Every hour of effective data must be genuinely created, incurring costs in equipment, labor, space, sensors, failure handling, and safety.

Robotics researcher Ken Goldberg once described the gap between robotics and internet-scale AI data as a "100k-year data gap." More precisely, the text and image data consumed by modern large vision-language models, if converted into human reading or viewing time, is equivalent to about 100,000 years, while robots lack the same scale of real-world interaction data. This statement does not set an exact threshold for robot models but serves as a reminder to the industry: real-world interaction data cannot be cheaply scraped like web text.

This is also why Garg opposes the "sweatshop-style teleoperation" narrative. Large-scale manual teleoperation can indeed produce action-dense training samples, but if companies evaluate data solely by total hours, funds may flow to repetitive, low-difficulty, low-information-density samples rather than the scenarios that best reduce failure rates.

Three Types of Data Buy Different Things

In Garg's classification, Physical AI data roughly falls into three categories: observation data, intervention data, and deployment data. All may be useful, but their costs, constraints, and information density vary significantly.

The first category is observation data, such as first-person or third-person videos. Its advantage is low cost and broad coverage, helping models understand objects, spaces, action outcomes, and environmental distributions. The shortcoming is clear: the model can see what happens to people or objects but may not know what action a robot should output in a given state.

The second category is intervention data, i.e., state-to-action trajectories generated through teleoperation, demonstration, and human intervention. This type of data is more direct for robot training because it contains the chain of "see what, move how, and what happens after moving." The cost is that each high-quality trajectory must be paid for; labor and equipment costs are unlikely to drop as quickly as software data.

The third category is deployment data, i.e., telemetry data generated when robots operate in real commercial scenarios. This sounds closest to a business flywheel: robots work, earn money, and produce training data simultaneously. But there is a statistical trap here.

The scenarios where robots are first deployed today are typically those with the least variation, the most fixed processes, and the most controllable risks, such as highly structured warehouses, factories, or single-task environments. The amount of such production data can be large, but the distribution is narrow and the repetition rate is high. Once the model learns local patterns, each additional hour of operation brings diminishing new information.

Deployment data is not worthless. What is truly valuable is often not the large number of "successful task" routine segments, but failures, stuck states, anomalous objects, boundary conditions, and rare perturbations. The problem is that these long-tail samples do not appear at the pace a company desires; the costs of discovery, screening, and review are also higher.

More Data Is Useful, But Repeated Samples Quickly Become Expensive

Garg is cautious about borrowing the scaling law from language models: more data usually reduces model loss, but with diminishing returns. If samples are repeated, near-duplicate, or come from the same narrow distribution, the benefit of additional data diminishes faster.

In robotics, this issue is more intuitive. A robot learning to pick a fixed box from a fixed shelf may find the first thousands of demonstrations, failures, and corrections very valuable. Once the actions, objects, lighting, and paths have been repeatedly collected, new data is more like duplicating already learned local experience.

Similar experiences exist in language model training: repeated and near-duplicate data waste training budgets, and excessive repetition can harm generalization. Garg does not directly apply these conclusions to robot training but uses them to illustrate a direction: measuring data value cannot rely solely on quantity; one must also consider how different the samples are from each other.

For Physical AI, diversity has at least two meanings. First, let the model see more objects, spaces, materials, lighting, occlusions, and manipulation methods. Second, avoid a model that performs well on an overly simplistic task distribution but fails when moved to slightly different scenarios.

Long-tail failure cases thus become critical. The real physical world is not uniformly distributed; low-frequency anomalies often determine commercial usability: an object placed slightly off, deformed packaging, reflective surfaces, gripper slipping, sudden human intervention, sensor occlusion, ground friction changes. No matter how well the model performs on routine samples, if it cannot handle these tail events, deployment will still be held back by a few failures.

The Deployment Flywheel Works Only If Early Scenarios Are Sufficiently "New"

What this article truly challenges is the common commercialization path of embodied intelligence companies: first deploy robots in narrow scenarios, use human remote intervention to ensure availability, collect production data, then train stronger models with this data to open up more scenarios.

Garg calls such paths "neo-integrator" thinking. It attempts to bypass pure data collection costs by placing robots in commercial production, letting operational revenue offset data costs. Compared to building a dedicated teleoperation factory, this path sounds more efficient.

But the flywheel has a prerequisite: the data generated by early commercial scenarios must be sufficiently new and diverse to help the model transfer to more tasks. If the deployment scenarios are just low-variation, low-entropy, heavily engineered custom tasks, data will quickly saturate. The company may end up not with a general capability flywheel, but a set of custom projects requiring continuous integration, maintenance, and exception handling.

This brings two types of costs. First, every time a new scenario is entered, investment is needed in environmental modifications, process adaptation, failure fallback, and safety mechanisms. Second, if deployment itself has not yet reached breakeven, scaling up may not mean collecting data at low cost, but losing money to acquire a large number of low-novelty samples.

Therefore, early deployment is not useless, but needs to be examined more closely: how much new task coverage does it bring, how many failure and anomaly samples does it generate, can these samples transfer to other scenarios, and after deducting hardware, labor, maintenance, and integration costs, how much model improvement does each dollar buy?

Valuation Narratives Should Not Only Ask How Many Hours Have Been Accumulated

Garg's recommendation is not to stop collecting data, but to replace evaluation metrics. Cumulative operational hours, teleoperation hours, and trajectory counts can serve as operational indicators, but they should not be directly equated to model progress.

More explanatory questions include: when does data for a single task become saturated, how much engineering integration cost is needed to add a new task, how many different scenarios and action clusters does the data cover, how much of the production data consists of genuine distribution shifts and anomaly samples, and how many routine success segments in the deployment stream should be filtered out instead of continuing to feed the model?

Corresponding to the three data types, capital allocation will also differ. Observation data should prioritize low cost, diversity, and broad coverage to expand basic capability boundaries. High-cost teleoperation and demonstration data, once reaching single-task saturation, should shift budget to more tasks rather than repeating the same action. Deployment data should focus on filtering failures, boundary conditions, and out-of-distribution samples, discarding large amounts of low-information-density routine operational records.

This perspective has practical implications for Physical AI valuation narratives. A company owning more robots, longer operational hours, and a larger teleoperation team does not automatically indicate a stronger model moat. A harder-to-replicate capability may be continuously finding valuable long-tail data, judging when a certain type of data is saturated, and covering more task distributions at lower cost.

However, this remains a capital allocation perspective, not an industry conclusion. Whether robot models will exhibit scaling benefits similar to language models, whether deployment data can continuously generate new information in certain high-dimensional scenarios, and how efficient transfer between different tasks is—these still need to be answered by more empirical results.

Garg's reminder lands on a more specific question: the "Moneyball metric" for Physical AI may not be data hours, but the novel samples bought per dollar. For robotics companies still telling data flywheel stories, the market may ultimately look not at how long cumulative operational time is, but at how much new information was generated during that time.

View Original

Data flywheel? Repeated samples? Robots should say goodbye to the "hours worship".

"Data Hours" May Be the Batting Average Superstition of the Robotics Industry

Three Types of Data Buy Different Things

More Data Is Useful, But Repeated Samples Quickly Become Expensive

The Deployment Flywheel Works Only If Early Scenarios Are Sufficiently "New"

Valuation Narratives Should Not Only Ask How Many Hours Have Been Accumulated

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇿🇦vs🇨🇦

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned