Data Flywheel or Repeated Samples? Physical AI Should Bid Farewell to “Hour Worship”

Question

> TL;DR > · Roboticist Animesh Garg questions the industry’s use of teleoperation hours as a metric for model capability. > · Robot data collection is costly; deployment data often comes from narrow scenarios, and repeated samples quickly become expensive. > · What may be more valuable are long-tail failures, task coverage, and novel samples, rather than total runtime. Animesh Garg, a roboticist who previously held and still holds positions at the University of Toronto and is currently at Georgia Tech, wrote an article titled “Moneyball for Physical AI,” comparing the data race in embodied intelligence to the “Moneyball” moment in baseball history. What he aims to challenge is an increasingly common fundraising narrative: that robot companies can build a data flywheel by simply stacking more teleoperation hours, more real-world deployments, and more runtime hours. For investors, this is not just academic debate. The cost structure, commercialization speed, and model moats of embodied intelligence companies are often packaged into the term “data closed loop.” If cumulative hours do not equate to effective model improvement, the market will need to reassess the data assets of these companies. “Data Hours” May Be the Batting Average Superstition of the Robotics Industry===================== Garg borrows the classic analogy from *Moneyball*. In 2002, the Oakland Athletics won 103 games with one of the lowest payrolls in the league. The key was not buying more expensive players, but discovering that the market mispriced player value. Traditional scouts valued batting average, stolen bases, and form, but the metric that better explained a team’s scoring ability was on-base percentage. In his view, Physical AI may be at a similar stage. The industry acknowledges that data is essential for achieving general-purpose robot models, but it tends to favor the easiest-to-demonstrate metrics as the most important: cumulative teleoperation hours, number of demonstration trajectories, number of deployed robots, and runtime in production scenarios. The supply of robot data differs from that of text data. Large language models can obtain massive amounts of low-cost text from the internet, code repositories, books, and web pages, with bottlenecks mainly in compute, cleaning, and training efficiency. Robot models require data that involves physical interaction, action feedback, and environmental changes. Each hour of effective data must be genuinely created, incurring costs for equipment, labor, space, sensors, failure handling, and safety. Roboticist Ken Goldberg once described the gap between robot data and internet-scale AI data as the “100k-year data gap.” More precisely, the amount of text and image data consumed in training modern large vision-language models, if converted to human reading or viewing time, is equivalent to about 100,000 years, while robots lack a comparable scale of real-world interaction data. This statement is not setting a precise threshold for robot models but reminding the industry that real-world interaction data cannot be scraped as cheaply as web text. This is also why Garg opposes the “sweatshop teleoperation” narrative. Massive human teleoperation can indeed generate action-dense training samples, but if companies evaluate data solely by total hours, capital may flow to repetitive, low-difficulty, low-information-density samples rather than scenarios that best reduce failure rates. Three Types of Data Buy Different Things============= In Garg’s classification, Physical AI data falls roughly into three categories: observation data, intervention data, and deployment data. All are potentially useful, but their costs, constraints, and information density vary greatly. The first category is observation data, such as first-person or third-person videos. Its advantage is low cost and broad coverage, helping models understand objects, spaces, action outcomes, and environmental distributions. The downside is clear: the model can see what a human or object does but may not know what actions the robot should output in a given state. The second category is intervention data, which includes teleoperation, demonstrations, and state-to-action trajectories generated by human intervention. This type is more direct for robot training because it contains the chain of “what is seen, how to move, and what happens after moving.” The cost is that each high-quality trajectory must be paid for, and labor and equipment costs do not scale down as quickly as software data. The third category is deployment data, i.e., telemetry generated when robots run in real commercial scenarios. This sounds closest to a business flywheel: robots work, earn money, and generate training data simultaneously. However, there is a statistical trap here. Today’s earliest deployed robot scenarios are typically those with the least variation, most fixed processes, and most controllable risks, such as highly structured warehouses, factories, or single-task environments. This kind of production data may be large in volume but narrow in distribution and high in repetition. Once a model learns local patterns, each additional hour of runtime brings diminishing new information. Deployment data is not without value. What is truly valuable is often not the large volume of “successful task completion” routine segments, but failures, stuck states, abnormal objects, edge cases, and rare disturbances. The problem is that these long-tail samples do not appear steadily at a company’s desired pace, and the costs of detecting, filtering, and reviewing them are higher. More Data Helps, but Repeated Samples Quickly Become Expensive================= Garg is cautious in borrowing scaling laws from language models: more data typically reduces model loss, but with diminishing returns. If samples are repeated, nearly repeated, or come from the same narrow distribution, the benefit of additional data diminishes faster. In robotics, this issue is more intuitive. For a robot learning to grasp a fixed product box from a fixed shelf, the first few thousand demonstrations, failures, and corrections can be very valuable. Once the actions, objects, lighting, and path are repeatedly collected, new data increasingly replicates already learned local experience. Similar lessons exist in language model training: repeated and near-duplicate data wastes training budgets, and excessive repetition can even harm generalization. Garg does not directly apply these conclusions to robot training but uses them to indicate a direction: measuring data value should not rely solely on quantity but also on how different samples are from each other. For Physical AI, diversity has at least two meanings. First, expose the model to more objects, spaces, materials, lighting, occlusions, and manipulation methods. Second, avoid the model performing well in an overly simple task distribution but failing when moved to slightly different scenarios. Long-tail failure cases thus become critical. The real physical world is not uniformly distributed; low-frequency anomalies often determine commercial viability: slightly misplaced objects, deformed packaging, reflective surfaces, gripper slippage, sudden human intervention, sensor occlusion, changes in floor friction. No matter how well the model performs on routine samples, if it cannot handle these tail events, deployment will be dragged down by a few failures. For the Deployment Flywheel to Work, Early Scenarios Must Be Sufficiently “Novel”================== What this article truly challenges is the common commercialization path of embodied intelligence companies: first deploy robots in narrow scenarios, ensure usability with human remote takeover, collect production data, then use that data to train stronger models and unlock more scenarios. Garg calls this approach the “neo-integrator” strategy. It attempts to bypass pure data collection costs by placing robots in commercial production, letting operating revenue offset data costs. Compared to building a dedicated teleoperation factory, this path sounds more efficient. But the flywheel has a prerequisite: the data generated from early commercial scenarios must be sufficiently novel and diverse to help the model transfer to more tasks. If the deployment scenario is just a low-variation, low-entropy, heavily engineered narrow task, the data will quickly saturate. What the company gets may not be a general-purpose capability flywheel, but a set of customized projects requiring continuous integration, maintenance, and exception handling. This leads to two types of costs. First, entering each new scenario requires investment in environment modification, workflow adaptation, failure mitigation, and safety mechanisms. Second, if the deployment itself has not reached breakeven, scaling up may not be about collecting low-cost data but about trading losses for large volumes of low-novelty samples. Thus, early deployment is not useless, but needs finer scrutiny: how much new task coverage does it bring, how many failures and anomalies does it generate, can these samples transfer to other scenarios, and after deducting hardware, labor, maintenance, and integration costs, how much model improvement does each dollar buy? Valuation Narratives Should Not Just Ask How Many Hours Were Accumulated============== Garg’s advice is not to stop collecting data but to replace evaluation metrics. Cumulative runtime, teleoperation hours, and trajectory counts can serve as operational metrics but should not be directly equated to model progress. More explanatory questions include: when does data for a single task saturate, how much engineering integration cost is needed for each new task, how many different scenarios and action clusters does the data cover, how much of the production data consists of true distribution shifts and anomaly samples, and how many routine success segments in the deployment stream should be filtered out rather than fed to the model. Corresponding to the three data types, capital allocation will also differ. Observation data should prioritize low cost, diversity, and broad coverage to expand basic capability boundaries. For high-cost teleoperation and demonstration data, after reaching saturation for a single task, the budget should shift to more tasks instead of repeating the same actions. Deployment data should focus on filtering failures, edge cases, and out-of-distribution samples, discarding large amounts of low-information-density routine records. This perspective has practical implications for Physical AI valuation narratives. A company owning more robots, longer runtime, and larger teleoperation teams does not automatically mean a stronger model moat. The harder-to-replicate capability may be the ability to continuously find high-value long-tail data, determine when a certain data type saturates, and cover more task distributions at lower cost. However, this remains a capital allocation perspective, not an industry consensus. Whether robot models will exhibit similar scaling gains as language models, whether deployment data can continuously generate new information in certain high-dimensional scenarios, and how high the transfer efficiency between different tasks is, all require more empirical results to answer. Garg’s reminder lands on a more specific question: the “Moneyball metric” for Physical AI may not be data hours, but the novelty samples bought per dollar. For robot companies still telling data flywheel stories, the market may ultimately care less about how long the cumulative runtime is, and more about how much new information was generated during that time. Click to learn about jobs at BlockBeats **Welcome to join the official BlockBeats community:**Telegram Subscription Group: https://t.me/theblockbeatsTelegram Discussion Group: https://t.me/BlockBeats_AppTwitter Official Account: https://twitter.com/BlockBeatsAsia

Data Flywheel or Repeated Samples? Physical AI Should Bid Farewell to “Hour Worship”

“Data Hours” May Be the Batting Average Superstition of the Robotics Industry

Three Types of Data Buy Different Things

More Data Helps, but Repeated Samples Quickly Become Expensive

For the Deployment Flywheel to Work, Early Scenarios Must Be Sufficiently “Novel”

Valuation Narratives Should Not Just Ask How Many Hours Were Accumulated

Trending Topics

Get2SharesOfSKHynixAtZeroCost

SaylorHintsAtMoreBTC

PredictWorldCup🇧🇷vs🇯🇵

SolanaEcosystemANSEMSurges

StakeUSD1Earn7.66%APR

Pinned