Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
DeepSeek's Path to Trillions of Dollars: Leveraging Open Source to Drive the Trillion-Dollar Hardware Ecosystem
Editor's Note: Over the past year, discussions around DeepSeek have mostly focused on model performance, open-source strategies, and price wars. But if you only understand DeepSeek through "selling subscriptions," "multimodal capabilities," or "building a coding agent," you might underestimate what it truly aims to change.
This article presents a more radical judgment: DeepSeek's goal may not be short-term monetization through application layers, but rather reshaping the cost structure of AI training and inference through a series of foundational architectural innovations, indirectly promoting the formation of a new hardware ecosystem. From MoE, MLA to DSA, CSA, mHC, Engram, and then to Dual Path and TileLang, DeepSeek’s technical route consistently revolves around a core question: how to run more powerful models with less high-end computing power, given constraints like HBM, advanced process nodes, packaging, and CUDA ecosystem.
The most noteworthy aspect of this article is not whether "DeepSeek can make hundreds of millions of dollars through APIs or subscriptions," but whether it is binding model capabilities, memory systems, and domestic hardware ecosystems together. KV Cache compression reduces dependence on HBM; NAND and SSD can handle long-term caching; LPDDR can be used for streaming weights and Engram storage; TileLang aims to weaken the CUDA moat. If these innovations continue to spread, the beneficiaries will not only be DeepSeek itself but also storage, ASICs, GPUs, network chips, and the entire AI infrastructure chain.
Of course, the judgments about a "10 trillion USD industry ecosystem" and a "trillion-dollar valuation" still carry strong speculative elements. But they provide an important path to understanding DeepSeek: open source does not necessarily mean abandoning commercialization; low prices are not necessarily just market subsidies. For DeepSeek, the real business may not be at the application layer but in helping more hardware become available and enabling lower-cost AI supply. In other words, what it sells may not be just the model itself, but the feasibility of the next-generation AI infrastructure.
Below is the original text:
Have you ever wondered how DeepSeek will make money, and possibly make a lot of it?
It hasn't launched competitive programming subscription plans like GLM, MoonShot, or MiniMax; nor does it have multimodal, audio, or video models. So far, it doesn't even have its own harness—an outer framework for model invocation, tool integration, and task execution—though they have recently started recruiting for related positions to build this system.
Meanwhile, DeepSeek seems to remain firmly committed to open source, even openly sharing its "secrets." Isn’t that crazy? Isn’t it just burning money? Are the investors willing to pour 10 billion dollars into it, only to throw that money down the drain?
I personally believe the answer is quite the opposite.
Next, I will share some observations based on what DeepSeek has done so far and analyze a strategic path it appears to be following. The goal of DeepSeek CEO Liang Wenfeng might be far beyond just competing with models today. He may be aiming for a bigger prize: DeepSeek has the potential to hit a 1 trillion USD valuation while driving the formation of a new industry worth 10 trillion USD.
Revisiting DeepSeek’s "Hero’s Journey"
DeepSeek has been fighting against the wind. It hasn't chosen to keep releasing slightly better models and then rush to package them into directly monetizable applications like programming subscriptions. On January 27, 2025, I posted a widely circulated tweet describing what I see as DeepSeek’s "Hero’s Journey." Now, this story has become even more interesting.
While others are still trying to build dense models, DeepSeek chose the more difficult-to-train expert mixture models (MoE).
They adopted a "first principles" approach, inventing a new GRPO algorithm to replace the then-mainstream but more costly PPO reinforcement learning algorithm.
They discovered that reinforcement learning based on verifiable rewards (Reinforcement Learning from Verified Rewards, RLVR) is a key strategy to enhance model inference capabilities.
They also proposed a simple decoding strategy called "Multi Token Prediction," which makes training signals denser.
They improved the "Zero Bubble" pipeline to better utilize limited GPU resources.
They released expert load balancers, making it easier for everyone to deploy MoE models. Especially through the "Wide Expert Parallel" strategy, models can serve larger batches, significantly reducing inference costs.
They invented mechanisms like MLA, DSA, CSA, HCA to reduce KV Cache requirements and keep the computational growth with context length as close to constant as possible.
They created Engram, trading memory for computational efficiency.
They also invented mHC, enabling stable training even as model size scales up. Many similar innovations exist.
In the "Hero’s Journey" narrative, heroes don’t initially know where their journey will lead. They learn along the way, gradually discovering their true mission, overcoming obstacles, ignoring skeptics, and facing malicious actors. They have flaws or shortcomings but ultimately overcome them to fulfill their mission. They confront seemingly insurmountable challenges, find ways to form alliances, and learn to use limited resources wisely. It’s this process that makes audiences root for the hero. This is also why DeepSeek has gained followers, global respect, and opponents.
As I will detail next, DeepSeek has been on this path for a long time, gradually discovering its ultimate destiny: its goal is not to sell programming subscriptions but to promote a 10 trillion USD scale Chinese AI hardware ecosystem and achieve a 1 trillion USD valuation. In this process, it will also create opportunities for many new entrants in the Western hardware ecosystem.
Let’s start with some interesting KV Cache calculations
Please see @SemiAnalysis_’s recent timely tweet:
DeepSeek has already solved this problem better than anyone else!
Let’s do some fun KV Cache calculations. Don’t worry if you dislike math. We will use the recently released KV Cache calculator to see how much KV Cache DeepSeek V4 Pro can save, and compare it with the latest GLM and Qwen models.
Here, I calculate with a context length of around 1 million tokens, assuming 8-bit KV precision and 16-bit indexer precision. You can also try the calculator yourself: https://kvcache.ai/tools/kv-cache-calculator/
At a context length of 1 million tokens:
· DeepSeek V4 only needs 5.48GB of HBM;
· GLM-5 requires 60GB of HBM;
· Qwen3-235B-A22B needs up to 89GB of HBM.
Note that:
· DeepSeek is a 1.6 trillion parameter model;
· GLM-5 has about 700 billion parameters and already uses DeepSeek’s MLA and DSA, but not the latest compression attention mechanisms;
· Qwen3-235B-A22B has about 235 billion parameters, using GQA attention.
DeepSeek’s innovations in alleviating memory pressure are fundamental. If these innovations are widely adopted, they will significantly reduce the operational costs of long-cycle agents and unlock new application scenarios.
The Methodology Behind "Madness"
The reason KV Cache size can be so small without sacrificing model quality is precisely why DeepSeek can offer long-term caching at extremely low prices—less than 3% of Sonnet 4.6’s cache hit cost, and DeepSeek can retain cache for hours.
For long-term tasks, a smaller KV Cache means it can be more economically unloaded to SSD and reloaded when needed. This reduces dependence on HBM. From the perspective of China’s AI hardware industry, HBM is not only in short supply but also one of the most difficult memories to manufacture.
Additionally, DeepSeek has developed technology to load KV Cache from SSD faster, as described in its Dual Path paper.
DeepSeek V4’s KV Cache compression is so extensive that this step might even become unnecessary.
So, who benefits most directly from KV Cache compression?
Who supplies large quantities of SSDs? Don’t forget, YMTC (Yangtze Memory Technologies) is growing into a giant in 3D NAND. NAND can help DeepSeek avoid recomputing KV. Conversely, DeepSeek creates a huge market for NAND and SSDs—benefiting not only Yangtze Memory but other related manufacturers as well.
But it’s not just about NAND and SSD.
LPDDR memory also has huge potential. It can serve as storage for model weights, streaming them into HBM as needed, alleviating pressure on HBM. The SGLang team published a good blog explaining this. The diagram below shows how this scheme works.
Although DeepSeek has not specifically designed for this scheme, its MoE architecture, large number of expert models, and 4-bit weights make this approach more feasible.
If combined with extremely compact, lossless KV Cache, this innovation could significantly reduce the demand for HBM.
So, who produces LPDDR in China? The answer is CXMT, or Changxin Memory Technologies. They are only about half a generation behind in speed and a generation behind in density, so the gap isn’t large.
Besides ample NAND, China’s AI ecosystem will soon have sufficient LPDDR supply. Can this ease computational pressure? The answer is yes. Keep reading.
Intelligent memory usage can also ease GPU/ASIC pressure
Using NAND to store KV Cache is quite straightforward: it allows KV Cache to be retained longer, reducing HBM pressure and avoiding recomputation, thus easing GPU and ASIC workloads.
So, can LPDDR also play a similar role? Besides being a storage location for streaming weights into HBM on demand, can it further reduce computational load?
The answer is yes.
LPDDR can store large amounts of what are called Engram contents. In DeepSeek’s Engram paper, they point out that MoE can expand model capacity via conditional computation, but Transformers lack a native "knowledge retrieval" mechanism. As a result, Transformers often have to inefficiently simulate retrieval through computation.
To address this, DeepSeek proposed the Engram module. It modernizes the classic N-gram embedding into a hash-based O(1) lookup mechanism, creating a complementary sparse path called "conditional memory."
This approach saves computation but requires memory to hold the embedding table, which can be very large.
Essentially, this is a typical "memory-for-computation" trade-off. The key insight is: from the perspective of data read costs per bit, "memory" is much cheaper—one LPDDR lookup is far cheaper than passing data through multiple Transformer layers for a forward pass. In large-scale scenarios, this is a very cost-effective exchange.
This is how DeepSeek sacrifices some memory to save computation.
Trade-offs Worth Making
Without chips with equivalent transistor density and EUV, Chinese GPUs and ASICs are likely to remain long-term behind Western GPUs in raw FLOPs. They also still lag significantly in advanced packaging. These trade-offs are very worthwhile, especially given China’s ability to mass-produce NAND and LPDDR.
Reviewing DeepSeek’s Long-term Strategy
From these innovations, it seems DeepSeek’s goal is not to make a few hundred million dollars in profit now. Many of its past choices—no multimodal, no speech models, no video models—indicate this.
Its real involvement is in a patient, long-term game possibly reaching a scale of 10 trillion USD: promoting the formation of an alternative AI hardware ecosystem.
This is not only to make Chinese memory manufacturers key players in the Chinese and global AI hardware markets but also to fundamentally reduce resource demands, making AI model training and deployment more cost-efficient. This way, many GPU, ASIC, and network chip manufacturers could become viable options.
Meanwhile, these innovations will also benefit the Western open-source ecosystem and new hardware manufacturers.
All signs point to this already happening. Let’s review some of DeepSeek’s innovations so far:
DeepSeek introduced MoE and MLA in V2. MoE reduces the training compute by about 40-50%; MLA cuts KV Cache needs by 90%.
This makes unloading KV Cache to SSD very efficient.
These ideas first appeared in DeepSeek’s May 2024 paper on DeepSeek V2. Later, they laid the groundwork for DeepSeek V3 training. At that time, DeepSeek trained a system close to closed-source models using only 2,048 weakened H800 GPUs.
The core role of DSA is to ensure that compute does not grow with increasing context length. The chart below shows that processing time in DeepSeek-V3.2 remains relatively stable as context length increases.
mHC is a macro-architecture innovation that redesigns information flow between Transformer layers.
Historically, models since ResNet used standard residual connections, i.e., x + F(x). mHC extends residual flow into multiple parallel channels, allowing learnable mixing between them. The key is constraining the mixing matrix to a bistochastic matrix via Sinkhorn-Knopp projection, ensuring signal stability regardless of model depth.
This solves the instability problem caused by unconstrained Hyper-Connections, which initially caused training collapse at 100k parameters due to exponential signal growth (up to 3,000x).
mHC’s computational cost is low—only about 6.7% additional training time—since it doesn’t alter FLOPs in attention or FFN layers, just the routing of their outputs.
Its performance improvements are significant: at 27 billion parameters, mHC boosts BIG-Bench Hard inference scores by 7.2 points, improves DROP by 3.2, raises GSM8K math scores by 2.8, and enhances MMLU general knowledge by 1.4, all under similar model size and computational budget.
Essentially, mHC provides a richer, more expressive inter-layer routing topology, achieving higher parameter efficiency with minimal extra FLOPs.
CSA and HSA aim to reduce KV Token requirements by 90% through KV Cache compression, while also significantly lowering FLOPs, alleviating pressure on HBM and GPU/ASIC.
The detailed diagram below shows that, with the same total parameter budget, Engram provides clear performance gains.
The detailed diagram below shows that, with the same total parameter budget, Engram provides clear performance gains.
With TileLang, developers can write a single kernel—low-level code for computation—and have it run successfully across multiple hardware platforms, provided those platforms have TileLang backends.
I expect other Chinese AI labs will join in, helping Chinese hardware vendors indirectly tackle the so-called "CUDA moat." It will also unlock more potential in Western hardware, like AMD.
It’s worth noting that many Chinese AI hardware platforms already support CUDA compatibility or have translation layers, such as Moore Threads, MuX, Biren, and Tianshu Zhixin, which achieve high CUDA compatibility through translation layers. So, in theory, they may not need TileLang.
Large-scale Reinforcement Learning and RSI
As DeepSeek gains access to more hardware options—more choices in hardware—and as the model’s own compute demands decrease, it can pursue more ambitious training projects, especially in reinforcement learning post-training.
Reinforcement learning requires generating vast trajectories, i.e., trillions of tokens. This process quickly becomes extremely expensive. Furthermore, training a model with a context length of 1 million tokens requires generating trajectories of the same length. Only training on such ultra-long trajectories can truly support long-term tasks.
Additionally, with more hardware options, DeepSeek can call upon more resources, promoting automation research—namely RSI. RSI refers to AI designing and executing experiments itself. This involves a lot of trial and error, with costs rising rapidly. But RSI is crucial for exploring the full model design space. Before reaching AGI, and later ASI, DeepSeek must develop RSI capabilities.
What DeepSeek is doing today, the entire industry will follow tomorrow
Innovations around expert mixture models, MLA, DSA, and others are already being adopted by other AI labs worldwide and in China.
For example, ZAI, the developer of the GLM series, uses MLA and DSA. Kimi, or MoonShot, also adopted MLA and openly states its architecture is based on DeepSeek’s design. Conversely, DeepSeek also uses the Muon optimizer, which was first employed by Kimi (MoonShot) in large-scale training.
It’s important to note:
MoE was first proposed by Google in 2017, with Noam Shazeer as a key author. DeepSeek’s contribution is applying MoE at scale and inventing its own supporting techniques.
Muon, or the MomentUm Orthogonalized by Newton-Schulz optimizer, was proposed by researcher Keller Jordan at the end of 2024. Kimi (MoonShot) was the first team to use it for large-scale training.
What about making money?
Let’s look at an interesting example from OpenAI.
OpenAI obtained warrants/options to buy AMD and Cerebras stock at lower prices, linked to their compute milestones. For AMD and Cerebras, this is a very cost-effective deal. Once OpenAI commits to using their hardware, their long-term success prospects increase significantly.
AMD’s announcement states:
"Under the agreement, to further coordinate strategic interests, AMD issued warrants to OpenAI to purchase up to 160 million AMD common shares, which will vest gradually upon reaching specific milestones. The first batch will vest upon initial deployment of 1 GW, with subsequent batches vesting as procurement scales to 6 GW. Vesting conditions also depend on AMD reaching certain stock price targets and OpenAI achieving technical and commercial milestones necessary for large-scale AMD deployment."
I expect DeepSeek will also reach similar agreements with multiple Chinese memory, ASIC, CPU, and network technology vendors, collaborating deeply to enable their hardware stacks to handle leading AI workloads.
Considering that the total market value of Western, including East Asian allies’ AI stocks, already exceeds 10 trillion USD, this "equity partnership" approach could help China build a similarly huge industry, taking its share and ultimately reaching a 1 trillion USD valuation.
This would not only generate profits far beyond traditional application subscriptions but also realize its goal of "benefiting everyone with AGI." Liang Wenfeng, a devoted Jim Simons fan and a savvy capital player, will not miss this opportunity.
Looking back at everything DeepSeek has done so far, only this explanation makes the most sense.
[Original Link]
Click to learn about Rhythm BlockBeats’ job openings
Join the Rhythm BlockBeats official community:
Telegram Subscription Group: https://t.me/theblockbeats
Telegram Discussion Group: https://t.me/BlockBeats_App
Twitter Official Account: https://twitter.com/BlockBeatsAsia