DeepMind's Sander Dieleman, famed for his work on diffusion models, quickly boosted it on Twitter, calling it an interesting piece of LLM history:

The original scaling law was wrong due to a bug, and it likely caused the industry to waste massive amounts of compute on a bunch of "overly large, undertrained" models.

One bug, two years burned.

When the bug was exposed, what we saw was not just a black hole of compute, but a boundary of intelligence reshaped by language itself — far deeper than imagined.

Scaling Law: The LLM Version of the Geocentric Model

In 2020, OpenAI concluded: Under a fixed compute budget, you should prioritize making the model larger rather than feeding it more data.

In formula terms, the optimal parameter count is proportional to compute raised to the power of 0.73 — parameters are the variable that should be aggressively scaled.

This statement directly defined the shape of GPT-3's generation. Stack parameters. Stack them to death. 175 billion.

It told developers worldwide: Don't ask, just stack parameters; as long as you make the model big enough, miracles will happen.

Two years later, DeepMind dropped Chinchilla, overturning this conclusion entirely: Models and data should be scaled together with roughly equal importance, with about 20 tokens per parameter being optimal.

They trained a 70-billion-parameter Chinchilla with 1.4 trillion tokens — less than half the size of GPT-3, but over four times the data.

Result: Under the same compute budget, it comprehensively outperformed Gopher, which had 280 billion parameters but only 300 billion tokens.

In plain English: With the same budget, one became a "puffy" strongman, the other a lean fighter.

Three years late, Peking University alum Weng Li delved into the mainstream explanation for the difference in later research — that it stems from how they calculated the total number of parameters.

And it gets worse. Even the "correct" Chinchilla itself isn't clean.

In 2024, Besiroglu et al. re-ran the data points from the Chinchilla paper and found that its own fitting also contained a bug:

The loss scale in the optimizer was set too high, and the Huber loss was averaged over samples instead of summed, causing premature termination of the fitting.

The paper that corrected the bug carried another bug itself.

At this point, the "first principles" that countless people have been chanting suddenly seem shaky.

The so-called Scaling Law was never an ironclad physical law like Newton's three laws; it's just an empirically fitted curve.

When Diogo Almeida believes the truth is not that, it's not a different methodology — "the original scaling law itself had a bug."

Three Moves by OpenAI That Fooled the Global AI Industry?

To create a lie that made the global AI community collectively believe, you only need three steps.

Step 1: Imprison the data.

The OpenAI paper fed all models — whether they were toddlers learning to walk (small models) or giants (large models) — the exact same amount of "food." About 130B tokens.

Small models were thus "overfed" or even "stuffed," while large models that truly needed massive data to fill their capacity were severely malnourished under the same token budget.

The Chinchilla paper later pointed out sharply: They used "fixed number of training tokens and learning rate schedule" for all models.

It's like making kindergarteners and PhD students take the same test with the same time limit, and then claiming that "performance depends only on talent."

Step 2: The self-deceiving LR decay.

They used cosine learning rate decay, which smoothly pushes the learning rate toward zero near the end of training.

As training approached the preset endpoint, the learning rate was artificially pushed to zero step by step, naturally flattening the model's progress.

Once the curve flattened, it looked like: The model has learned all it can; feeding more is useless.

The researchers then concluded: "Adding more data is useless; the model is saturated."

This wasn't the model's limit; this was the learning rate artificially cutting off the model's growth path. It created a perfect illusion: Performance had hit a ceiling, and adding more data wouldn't help.

But now we know those large models were nowhere near their limit.

Step 3: The arrogance of authority.

Step three, and the most insidious: The paper stated that the results were "largely independent of learning rate schedule."

Although many people, including Diogo Almeida (who was at OpenAI at the time), vaguely sensed something was off, under the fixed token limit, the conclusion was technically correct.

But it simply does not apply to the ideal world of "unlimited data" that scaling law truly aims to describe.

They took a local truth under limited conditions and presented it as a universal cosmic law.

Stack these three steps, and you get a law that is both wrong and extremely hard to debug.

Even Diogo himself admits: He was also working on optimization at OpenAI at the time and didn't spot the bug — the learning rate curve looked too much like a "carefully tuned" one; who would suspect it?

GPUs Wasted, Compute Misallocated Severely

Guided by OpenAI's erroneous formula, the AI industry entered an era of "brute force and miracles."

This means that over the past few years, the world's brightest minds and most scarce compute have been wasted on ineffective scaling.

This isn't just about money; it's about humanity collectively sprinting thousands of kilometers down the wrong track on the race to AGI (Artificial General Intelligence), all because of a learning rate setting.

If the discovery of the bug was painful, the subsequent deep reflections are chilling.

Researcher Adam Zachary Wasserman pointed out a blind spot overlooked by everyone: Even after the formula is fixed, current Scaling Law is just "English Scaling Law."

He conducted a counterintuitive experiment: Training models with the same architecture and same compute.

Result: The French model achieved a certain grammatical ability with an efficiency 50 to 100 times higher than the English model.

Why? Because English is a "morphologically poor" language.

It relies too heavily on distributional patterns, requiring the model to guess word meanings from massive data; while languages like French or Chinese, which are morphologically rich or structurally tight, carry a lot of explicit information in the vocabulary itself.

This means that all our current compute allocation schemes are based on the most "data-hungry," least efficient language.

When you think you're exploring the physical laws of "general intelligence," you're actually just measuring "how wasteful English is with compute."

It's like trying to establish nutritional standards for all life in the universe by studying the appetite of a pig — not just bias, but cognitive limitation.

We could have achieved stronger performance with smaller models and more high-quality data.

We could have saved tens of thousands of H100 running hours of electricity and heat.

We could have entered an era of "efficient AI" two years earlier.

Source: Xin Zhi Yuan

Risk Warning and Disclaimer

        Market risk: Invest with caution. This article does not constitute personal investment advice and does not take into account individual users' specific investment goals, financial situations, or needs. Users should consider whether any opinions, views, or conclusions in this article suit their specific circumstances. Invest accordingly at your own risk.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
gStocksTokenizedStocksLive
4.81M Popularity
#
WeakNFPShakesRateHikeOdds
1.07M Popularity
#
PredictWorldCup🇧🇷vs🇳🇴
234.41K Popularity
#
ETHBreaks1700
152.63M Popularity
#
MetaSellsComputeTriggersChipSlump
1.41M Popularity

Pinned

Sitemap

OpenAI Collapses! The original Scaling Law paper reveals a bug, and trillions of compute are all wasted.

Scaling Law: The LLM Version of the Geocentric Model

Three Moves by OpenAI That Fooled the Global AI Industry?

GPUs Wasted, Compute Misallocated Severely

Trending Topics

gStocksTokenizedStocksLive

WeakNFPShakesRateHikeOdds

PredictWorldCup🇧🇷vs🇳🇴

ETHBreaks1700

MetaSellsComputeTriggersChipSlump

Pinned