Editor's Note: Over the past year, discussions around DeepSeek have mostly focused on model performance, open-source strategies, and price wars. But if you only understand DeepSeek through "selling subscriptions," "multimodal capabilities," or "building a coding agent," you might underestimate what it truly aims to change.

This article presents a more radical judgment: DeepSeek's goal may not be short-term monetization through application layers, but rather reshaping the cost structure of AI training and inference through a series of foundational architectural innovations, indirectly promoting the formation of a new hardware ecosystem. From MoE, MLA to DSA, CSA, mHC, Engram, and then to Dual Path and TileLang, DeepSeek’s technical route consistently revolves around a core question: how to run more powerful models with less high-end computing power, given constraints like HBM, advanced process nodes, packaging, and CUDA ecosystem.

The most noteworthy aspect of this article is not whether "DeepSeek can make hundreds of millions of dollars through APIs or subscriptions," but whether it is binding model capabilities, memory systems, and domestic hardware ecosystems together. KV Cache compression reduces dependence on HBM; NAND and SSD can handle long-term caching; LPDDR can be used for streaming weights and Engram storage; TileLang aims to weaken the CUDA moat. If these innovations continue to spread, the beneficiaries will not only be DeepSeek itself but also storage, ASICs, GPUs, network chips, and the entire AI infrastructure chain.

Of course, the judgments about a "10 trillion USD industry ecosystem" and a "trillion-dollar valuation" still carry strong speculative elements. But they provide an important path to understanding DeepSeek: open source does not necessarily mean abandoning commercialization; low prices are not necessarily just market subsidies. For DeepSeek, the real business may not be at the application layer but in helping more hardware become available and enabling lower-cost AI supply. In other words, what it sells may not be just the model itself, but the feasibility of the next-generation AI infrastructure.

Below is the original text:

Have you ever wondered how DeepSeek will make money, and possibly make a lot of it?

It hasn't launched competitive programming subscription plans like GLM, MoonShot, or MiniMax; nor does it have multimodal, audio, or video models. So far, it doesn't even have its own harness—an outer framework for model invocation, tool integration, and task execution—though they have recently started recruiting for related positions to build this system.

Meanwhile, DeepSeek seems to remain firmly committed to open source, even openly sharing its "secrets." Isn’t that crazy? Isn’t it just burning money? Are the investors willing to pour 10 billion dollars into it, only to throw that money down the drain?

I personally believe the answer is quite the opposite.

Next, I will share some observations based on what DeepSeek has done so far and analyze a strategic path it appears to be following. The goal of DeepSeek CEO Liang Wenfeng might be far beyond just competing with models today. He may be aiming for a bigger prize: DeepSeek has the potential to hit a 1 trillion USD valuation while driving the formation of a new industry worth 10 trillion USD.

TechInAsia on DeepSeek’s latest funding round

Revisiting DeepSeek’s "Hero’s Journey"

DeepSeek has been fighting against the wind. It hasn't chosen to keep releasing slightly better models and then rush to package them into directly monetizable applications like programming subscriptions. On January 27, 2025, I posted a widely circulated tweet describing what I see as DeepSeek’s "Hero’s Journey." Now, this story has become even more interesting.

While others are still trying to build dense models, DeepSeek chose the more difficult-to-train expert mixture models (MoE).

They adopted a "first principles" approach, inventing a new GRPO algorithm to replace the then-mainstream but more costly PPO reinforcement learning algorithm.

They discovered that reinforcement learning based on verifiable rewards (Reinforcement Learning from Verified Rewards, RLVR) is a key strategy to enhance model inference capabilities.

They also proposed a simple decoding strategy called "Multi Token Prediction," which makes training signals denser.

They improved the "Zero Bubble" pipeline to better utilize limited GPU resources.

They released expert load balancers, making it easier for everyone to deploy MoE models. Especially through the "Wide Expert Parallel" strategy, models can serve larger batches, significantly reducing inference costs.

They invented mechanisms like MLA, DSA, CSA, HCA to reduce KV Cache requirements and keep the computational growth with context length as close to constant as possible.

They created Engram, trading memory for computational efficiency.

They also invented mHC, enabling stable training even as model size scales up. Many similar innovations exist.

In the "Hero’s Journey" narrative, heroes don’t initially know where their journey will lead. They learn along the way, gradually discovering their true mission, overcoming obstacles, ignoring skeptics, and facing malicious actors. They have flaws or shortcomings but ultimately overcome them to fulfill their mission. They confront seemingly insurmountable challenges, find ways to form alliances, and learn to use limited resources wisely. It’s this process that makes audiences root for the hero. This is also why DeepSeek has gained followers, global respect, and opponents.

As I will detail next, DeepSeek has been on this path for a long time, gradually discovering its ultimate destiny: its goal is not to sell programming subscriptions but to promote a 10 trillion USD scale Chinese AI hardware ecosystem and achieve a 1 trillion USD valuation. In this process, it will also create opportunities for many new entrants in the Western hardware ecosystem.

Let’s start with some interesting KV Cache calculations

Please see @SemiAnalysis_’s recent timely tweet:

DeepSeek has already solved this problem better than anyone else!

Let’s do some fun KV Cache calculations. Don’t worry if you dislike math. We will use the recently released KV Cache calculator to see how much KV Cache DeepSeek V4 Pro can save, and compare it with the latest GLM and Qwen models.

Here, I calculate with a context length of around 1 million tokens, assuming 8-bit KV precision and 16-bit indexer precision. You can also try the calculator yourself: https://kvcache.ai/tools/kv-cache-calculator/

Feel free to try the calculator yourself!

At a context length of 1 million tokens:

· DeepSeek V4 only needs 5.48GB of HBM;
· GLM-5 requires 60GB of HBM;
· Qwen3-235B-A22B needs up to 89GB of HBM.

Note that:

· DeepSeek is a 1.6 trillion parameter model;
· GLM-5 has about 700 billion parameters and already uses DeepSeek’s MLA and DSA, but not the latest compression attention mechanisms;
· Qwen3-235B-A22B has about 235 billion parameters, using GQA attention.

DeepSeek’s innovations in alleviating memory pressure are fundamental. If these innovations are widely adopted, they will significantly reduce the operational costs of long-cycle agents and unlock new application scenarios.

KV Cache occupancy comparison at around 1 million tokens context and model scale

The Methodology Behind "Madness"

The reason KV Cache size can be so small without sacrificing model quality is precisely why DeepSeek can offer long-term caching at extremely low prices—less than 3% of Sonnet 4.6’s cache hit cost, and DeepSeek can retain cache for hours.

For long-term tasks, a smaller KV Cache means it can be more economically unloaded to SSD and reloaded when needed. This reduces dependence on HBM. From the perspective of China’s AI hardware industry, HBM is not only in short supply but also one of the most difficult memories to manufacture.

Additionally, DeepSeek has developed technology to load KV Cache from SSD faster, as described in its Dual Path paper.

DeepSeek V4’s KV Cache compression is so extensive that this step might even become unnecessary.

So, who benefits most directly from KV Cache compression?

Who supplies large quantities of SSDs? Don’t forget, YMTC (Yangtze Memory Technologies) is growing into a giant in 3D NAND. NAND can help DeepSeek avoid recomputing KV. Conversely, DeepSeek creates a huge market for NAND and SSDs—benefiting not only Yangtze Memory but other related manufacturers as well.

But it’s not just about NAND and SSD.

LPDDR memory also has huge potential. It can serve as storage for model weights, streaming them into HBM as needed, alleviating pressure on HBM. The SGLang team published a good blog explaining this. The diagram below shows how this scheme works.

Although DeepSeek has not specifically designed for this scheme, its MoE architecture, large number of expert models, and 4-bit weights make this approach more feasible.

This diagram illustrates how memory might be used and how model weights could stream from LPDDR to HBM. Highly recommended reading SGLang’s blog.

If combined with extremely compact, lossless KV Cache, this innovation could significantly reduce the demand for HBM.

So, who produces LPDDR in China? The answer is CXMT, or Changxin Memory Technologies. They are only about half a generation behind in speed and a generation behind in density, so the gap isn’t large.

Besides ample NAND, China’s AI ecosystem will soon have sufficient LPDDR supply. Can this ease computational pressure? The answer is yes. Keep reading.

Intelligent memory usage can also ease GPU/ASIC pressure

Using NAND to store KV Cache is quite straightforward: it allows KV Cache to be retained longer, reducing HBM pressure and avoiding recomputation, thus easing GPU and ASIC workloads.

So, can LPDDR also play a similar role? Besides being a storage location for streaming weights into HBM on demand, can it further reduce computational load?

The answer is yes.

LPDDR can store large amounts of what are called Engram contents. In DeepSeek’s Engram paper, they point out that MoE can expand model capacity via conditional computation, but Transformers lack a native "knowledge retrieval" mechanism. As a result, Transformers often have to inefficiently simulate retrieval through computation.

To address this, DeepSeek proposed the Engram module. It modernizes the classic N-gram embedding into a hash-based O(1) lookup mechanism, creating a complementary sparse path called "conditional memory."

This approach saves computation but requires memory to hold the embedding table, which can be very large.

Essentially, this is a typical "memory-for-computation" trade-off. The key insight is: from the perspective of data read costs per bit, "memory" is much cheaper—one LPDDR lookup is far cheaper than passing data through multiple Transformer layers for a forward pass. In large-scale scenarios, this is a very cost-effective exchange.

This is how DeepSeek sacrifices some memory to save computation.

Trade-offs Worth Making

Without chips with equivalent transistor density and EUV, Chinese GPUs and ASICs are likely to remain long-term behind Western GPUs in raw FLOPs. They also still lag significantly in advanced packaging. These trade-offs are very worthwhile, especially given China’s ability to mass-produce NAND and LPDDR.

Reviewing DeepSeek’s Long-term Strategy

From these innovations, it seems DeepSeek’s goal is not to make a few hundred million dollars in profit now. Many of its past choices—no multimodal, no speech models, no video models—indicate this.

Its real involvement is in a patient, long-term game possibly reaching a scale of 10 trillion USD: promoting the formation of an alternative AI hardware ecosystem.

This is not only to make Chinese memory manufacturers key players in the Chinese and global AI hardware markets but also to fundamentally reduce resource demands, making AI model training and deployment more cost-efficient. This way, many GPU, ASIC, and network chip manufacturers could become viable options.

Meanwhile, these innovations will also benefit the Western open-source ecosystem and new hardware manufacturers.

All signs point to this already happening. Let’s review some of DeepSeek’s innovations so far:

Introduction of MoE and MLA in DeepSeek V2

DeepSeek introduced MoE and MLA in V2. MoE reduces the training compute by about 40-50%; MLA cuts KV Cache needs by 90%.

This makes unloading KV Cache to SSD very efficient.

These ideas first appeared in DeepSeek’s May 2024 paper on DeepSeek V2. Later, they laid the groundwork for DeepSeek V3 training. At that time, DeepSeek trained a system close to closed-source models using only 2,048 weakened H800 GPUs.

DSA: introduced in DeepSeek V3.2 Exp, to reduce compute overhead in long-context scenarios and ease HBM bandwidth pressure.

The core role of DSA is to ensure that compute does not grow with increasing context length. The chart below shows that processing time in DeepSeek-V3.2 remains relatively stable as context length increases.

mHC: proposed in DeepSeek’s December 2025 paper "mHC: Manifold-Constrained Hyper-Connections."

mHC is a macro-architecture innovation that redesigns information flow between Transformer layers.

Historically, models since ResNet used standard residual connections, i.e., x + F(x). mHC extends residual flow into multiple parallel channels, allowing learnable mixing between them. The key is constraining the mixing matrix to a bistochastic matrix via Sinkhorn-Knopp projection, ensuring signal stability regardless of model depth.

This solves the instability problem caused by unconstrained Hyper-Connections, which initially caused training collapse at 100k parameters due to exponential signal growth (up to 3,000x).

mHC’s computational cost is low—only about 6.7% additional training time—since it doesn’t alter FLOPs in attention or FFN layers, just the routing of their outputs.

Its performance improvements are significant: at 27 billion parameters, mHC boosts BIG-Bench Hard inference scores by 7.2 points, improves DROP by 3.2, raises GSM8K math scores by 2.8, and enhances MMLU general knowledge by 1.4, all under similar model size and computational budget.

Essentially, mHC provides a richer, more expressive inter-layer routing topology, achieving higher parameter efficiency with minimal extra FLOPs.

mHC is a complex architecture but offers more stable training and higher parameter efficiency.

CSA, HSA: introduced in DeepSeek V4 in April 2026.

CSA and HSA aim to reduce KV Token requirements by 90% through KV Cache compression, while also significantly lowering FLOPs, alleviating pressure on HBM and GPU/ASIC.

Engram: introduced in early 2026, essentially using memory—LPDDR—to trade for computational efficiency.

The detailed diagram below shows that, with the same total parameter budget, Engram provides clear performance gains.

Engram: introduced in early 2026, essentially using memory—LPDDR—to trade for computational efficiency.

The detailed diagram below shows that, with the same total parameter budget, Engram provides clear performance gains.

This is advice shared with hardware manufacturers in DeepSeek’s V4 paper. I am quite sure that offline feedback from them would be even more extensive.

Investment in TileLang also points in the same direction: DeepSeek is not just solving its own compute bottleneck but is pushing China’s hardware ecosystem to compete with the West.

With TileLang, developers can write a single kernel—low-level code for computation—and have it run successfully across multiple hardware platforms, provided those platforms have TileLang backends.

I expect other Chinese AI labs will join in, helping Chinese hardware vendors indirectly tackle the so-called "CUDA moat." It will also unlock more potential in Western hardware, like AMD.

It’s worth noting that many Chinese AI hardware platforms already support CUDA compatibility or have translation layers, such as Moore Threads, MuX, Biren, and Tianshu Zhixin, which achieve high CUDA compatibility through translation layers. So, in theory, they may not need TileLang.

Large-scale Reinforcement Learning and RSI

As DeepSeek gains access to more hardware options—more choices in hardware—and as the model’s own compute demands decrease, it can pursue more ambitious training projects, especially in reinforcement learning post-training.

Reinforcement learning requires generating vast trajectories, i.e., trillions of tokens. This process quickly becomes extremely expensive. Furthermore, training a model with a context length of 1 million tokens requires generating trajectories of the same length. Only training on such ultra-long trajectories can truly support long-term tasks.

Additionally, with more hardware options, DeepSeek can call upon more resources, promoting automation research—namely RSI. RSI refers to AI designing and executing experiments itself. This involves a lot of trial and error, with costs rising rapidly. But RSI is crucial for exploring the full model design space. Before reaching AGI, and later ASI, DeepSeek must develop RSI capabilities.

What DeepSeek is doing today, the entire industry will follow tomorrow

Innovations around expert mixture models, MLA, DSA, and others are already being adopted by other AI labs worldwide and in China.

For example, ZAI, the developer of the GLM series, uses MLA and DSA. Kimi, or MoonShot, also adopted MLA and openly states its architecture is based on DeepSeek’s design. Conversely, DeepSeek also uses the Muon optimizer, which was first employed by Kimi (MoonShot) in large-scale training.

It’s important to note:

MoE was first proposed by Google in 2017, with Noam Shazeer as a key author. DeepSeek’s contribution is applying MoE at scale and inventing its own supporting techniques.

Muon, or the MomentUm Orthogonalized by Newton-Schulz optimizer, was proposed by researcher Keller Jordan at the end of 2024. Kimi (MoonShot) was the first team to use it for large-scale training.

What about making money?

Let’s look at an interesting example from OpenAI.

OpenAI obtained warrants/options to buy AMD and Cerebras stock at lower prices, linked to their compute milestones. For AMD and Cerebras, this is a very cost-effective deal. Once OpenAI commits to using their hardware, their long-term success prospects increase significantly.

AMD’s announcement states:

"Under the agreement, to further coordinate strategic interests, AMD issued warrants to OpenAI to purchase up to 160 million AMD common shares, which will vest gradually upon reaching specific milestones. The first batch will vest upon initial deployment of 1 GW, with subsequent batches vesting as procurement scales to 6 GW. Vesting conditions also depend on AMD reaching certain stock price targets and OpenAI achieving technical and commercial milestones necessary for large-scale AMD deployment."

I expect DeepSeek will also reach similar agreements with multiple Chinese memory, ASIC, CPU, and network technology vendors, collaborating deeply to enable their hardware stacks to handle leading AI workloads.

Considering that the total market value of Western, including East Asian allies’ AI stocks, already exceeds 10 trillion USD, this "equity partnership" approach could help China build a similarly huge industry, taking its share and ultimately reaching a 1 trillion USD valuation.

This would not only generate profits far beyond traditional application subscriptions but also realize its goal of "benefiting everyone with AGI." Liang Wenfeng, a devoted Jim Simons fan and a savvy capital player, will not miss this opportunity.

Looking back at everything DeepSeek has done so far, only this explanation makes the most sense.

These are key AI stocks. The chart does not include hyperscalers—large-scale cloud providers—and many other related companies.

[Original Link]

Click to learn about Rhythm BlockBeats’ job openings

Join the Rhythm BlockBeats official community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Twitter Official Account: https://twitter.com/BlockBeatsAsia

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

6 Likes

Reward
6
11
1
Share

Comment

Add a comment

SmallPosition,BigMouth

· 6h ago

100 trillion dollars? That number sounds like science fiction.

View OriginalReply0

ThereAreCatsInTheContract.

· 8h ago

So DeepSeek is part of a bigger game.

View OriginalReply0

BlackGoldMechanicalHand

· 10h ago

Is 10 trillion a valuation target or an industry scale? I'm a bit confused.

View OriginalReply0

GateUser-26374bb4

· 10h ago

In the end, after the price war, the winner takes all is the infrastructure.

View OriginalReply0

PaperSculptureSquidward

· 10h ago

Someone finally stepped out of the model evaluation to look at the bigger picture.

View OriginalReply0

GateUser-34d2b0ab

· 10h ago

If we could truly reshape the underlying infrastructure, then these applications are all pseudo-necessities.

View OriginalReply0

SlippageSailor

· 10h ago

If this judgment is true, then everyone buying tokens now is helping it train soldiers.

View OriginalReply0

GlitchOrchard

· 10h ago

This perspective is quite interesting; I had indeed only been focusing on the application layer before.

View OriginalReply0

ThetaSideEye

· 10h ago

Wait for the full text; this editor's note really knows how to tease the reader.

View OriginalReply0

SushiSlippage

· 10h ago

Compiled by Peggy? BlockBeats' quality has always been reliable

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
15.93M Popularity
#
USIranDraftDeal
288.55K Popularity
#
TradeCFDWinGold
3.03M Popularity
#
HYPEMarketCapSurpassesDOGE
12.64M Popularity
#
PlatinumCardCreatorExclusive
155.83K Popularity

Pinned

Sitemap

DeepSeek's Path to Trillions of Dollars: Leveraging Open Source to Drive the Trillion-Dollar Hardware Ecosystem

Revisiting DeepSeek’s "Hero’s Journey"

Let’s start with some interesting KV Cache calculations

The Methodology Behind "Madness"

So, who benefits most directly from KV Cache compression?

Intelligent memory usage can also ease GPU/ASIC pressure

Trade-offs Worth Making

Reviewing DeepSeek’s Long-term Strategy

Large-scale Reinforcement Learning and RSI

What DeepSeek is doing today, the entire industry will follow tomorrow

What about making money?

Trending Topics

StockTradingChallengeUpTo17000U

USIranDraftDeal

TradeCFDWinGold

HYPEMarketCapSurpassesDOGE

PlatinumCardCreatorExclusive

Pinned