Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
CFD
U.S. stock CFD derivatives
US Stocks
Access real US stocks and ETFs
HK Stocks
Trade quality Hong Kong-listed stocks
Korean Stocks
SK Hynix
Real Korean stocks and top assets
Stock Futures
High leverage, 24/7 trading
Tokenized Stocks
Backed by real stock assets
IPO Access
Unlock full access to global stock IPOs
GUSD
Mint GUSD for Treasury RWA yields
Stocks Activities
Trade Popular Stocks and Unlock Generous Airdrops
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
IPO Access
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
OpenAI Collapses! The original Scaling Law paper reveals a bug, and trillions of compute are all wasted.
OpenAI Misled the Entire AI Industry for Years!
Over the past five years, the whole AI industry has been driven forward by Scaling Law.
Altman's confidence in AGI stems from this curve.
Now, someone has stepped forward and said: This curve was wrong from the start.
This isn't hindsight. The person saying this is a researcher who was optimizing large models at OpenAI back then: Diogo Almeida.
He just published a blog post with a chilling title — Scaling Laws, Honestly.
The opening line cuts straight to the point: The original scaling law was wrong because of a bug.
DeepMind's Sander Dieleman, famed for his work on diffusion models, quickly boosted it on Twitter, calling it an interesting piece of LLM history:
One bug, two years burned.
When the bug was exposed, what we saw was not just a black hole of compute, but a boundary of intelligence reshaped by language itself — far deeper than imagined.
Scaling Law: The LLM Version of the Geocentric Model
In 2020, OpenAI concluded: Under a fixed compute budget, you should prioritize making the model larger rather than feeding it more data.
In formula terms, the optimal parameter count is proportional to compute raised to the power of 0.73 — parameters are the variable that should be aggressively scaled.
This statement directly defined the shape of GPT-3's generation. Stack parameters. Stack them to death. 175 billion.
It told developers worldwide: Don't ask, just stack parameters; as long as you make the model big enough, miracles will happen.
Two years later, DeepMind dropped Chinchilla, overturning this conclusion entirely: Models and data should be scaled together with roughly equal importance, with about 20 tokens per parameter being optimal.
They trained a 70-billion-parameter Chinchilla with 1.4 trillion tokens — less than half the size of GPT-3, but over four times the data.
Result: Under the same compute budget, it comprehensively outperformed Gopher, which had 280 billion parameters but only 300 billion tokens.
In plain English: With the same budget, one became a "puffy" strongman, the other a lean fighter.
Three years late, Peking University alum Weng Li delved into the mainstream explanation for the difference in later research — that it stems from how they calculated the total number of parameters.
And it gets worse. Even the "correct" Chinchilla itself isn't clean.
In 2024, Besiroglu et al. re-ran the data points from the Chinchilla paper and found that its own fitting also contained a bug:
At this point, the "first principles" that countless people have been chanting suddenly seem shaky.
The so-called Scaling Law was never an ironclad physical law like Newton's three laws; it's just an empirically fitted curve.
When Diogo Almeida believes the truth is not that, it's not a different methodology — "the original scaling law itself had a bug."
Three Moves by OpenAI That Fooled the Global AI Industry?
To create a lie that made the global AI community collectively believe, you only need three steps.
Step 1: Imprison the data.
The OpenAI paper fed all models — whether they were toddlers learning to walk (small models) or giants (large models) — the exact same amount of "food." About 130B tokens.
Small models were thus "overfed" or even "stuffed," while large models that truly needed massive data to fill their capacity were severely malnourished under the same token budget.
The Chinchilla paper later pointed out sharply: They used "fixed number of training tokens and learning rate schedule" for all models.
It's like making kindergarteners and PhD students take the same test with the same time limit, and then claiming that "performance depends only on talent."
Step 2: The self-deceiving LR decay.
They used cosine learning rate decay, which smoothly pushes the learning rate toward zero near the end of training.
As training approached the preset endpoint, the learning rate was artificially pushed to zero step by step, naturally flattening the model's progress.
Once the curve flattened, it looked like: The model has learned all it can; feeding more is useless.
The researchers then concluded: "Adding more data is useless; the model is saturated."
This wasn't the model's limit; this was the learning rate artificially cutting off the model's growth path. It created a perfect illusion: Performance had hit a ceiling, and adding more data wouldn't help.
But now we know those large models were nowhere near their limit.
Step 3: The arrogance of authority.
Step three, and the most insidious: The paper stated that the results were "largely independent of learning rate schedule."
Although many people, including Diogo Almeida (who was at OpenAI at the time), vaguely sensed something was off, under the fixed token limit, the conclusion was technically correct.
But it simply does not apply to the ideal world of "unlimited data" that scaling law truly aims to describe.
They took a local truth under limited conditions and presented it as a universal cosmic law.
Stack these three steps, and you get a law that is both wrong and extremely hard to debug.
Even Diogo himself admits: He was also working on optimization at OpenAI at the time and didn't spot the bug — the learning rate curve looked too much like a "carefully tuned" one; who would suspect it?
GPUs Wasted, Compute Misallocated Severely
Guided by OpenAI's erroneous formula, the AI industry entered an era of "brute force and miracles."
This means that over the past few years, the world's brightest minds and most scarce compute have been wasted on ineffective scaling.
This isn't just about money; it's about humanity collectively sprinting thousands of kilometers down the wrong track on the race to AGI (Artificial General Intelligence), all because of a learning rate setting.
If the discovery of the bug was painful, the subsequent deep reflections are chilling.
Researcher Adam Zachary Wasserman pointed out a blind spot overlooked by everyone: Even after the formula is fixed, current Scaling Law is just "English Scaling Law."
He conducted a counterintuitive experiment: Training models with the same architecture and same compute.
Result: The French model achieved a certain grammatical ability with an efficiency 50 to 100 times higher than the English model.
Why? Because English is a "morphologically poor" language.
It relies too heavily on distributional patterns, requiring the model to guess word meanings from massive data; while languages like French or Chinese, which are morphologically rich or structurally tight, carry a lot of explicit information in the vocabulary itself.
This means that all our current compute allocation schemes are based on the most "data-hungry," least efficient language.
When you think you're exploring the physical laws of "general intelligence," you're actually just measuring "how wasteful English is with compute."
It's like trying to establish nutritional standards for all life in the universe by studying the appetite of a pig — not just bias, but cognitive limitation.
We could have achieved stronger performance with smaller models and more high-quality data.
We could have saved tens of thousands of H100 running hours of electricity and heat.
We could have entered an era of "efficient AI" two years earlier.
Source: Xin Zhi Yuan
Risk Warning and Disclaimer