DeepSeek-V4 pioneers a brand-new attention mechanism that compresses along the token dimension, combined with DSA Sparse Attention (DeepSeek Sparse Attention), achieving globally leading long-context capabilities, and significantly reducing the demands on computation and memory compared to traditional methods.

Don’t underestimate it; DeepSeek-V4 greatly reduces the requirements for computation and memory.

Miaotou believes this will directly weaken NVIDIA’s GPU advantage. It’s worth noting that DeepSeek-V4 also prioritizes compatibility with domestic chip manufacturers.

In other words, don’t overestimate NVIDIA’s moat, and don’t underestimate the architectural revolution that DeepSeek is sparking. The key isn’t “who replaces whom,” but rather how profit distribution, deployment paths, and investment logic in the AI industry chain may be changing.

Dancing with “shackles”

Over the past two years, AI large models have mainly focused on training, competing with computing power.

To some extent, the competition among foundational AI models is essentially a competition for GPU computing infrastructure. Whoever can acquire more high-end GPUs, build larger clusters, has a better chance to develop more powerful foundational models.

However, due to US export controls banning sales of top-tier chips like H100/H200 to China, and TSMC’s advanced process technology being blocked by the US, domestic GPUs still lag behind NVIDIA’s cards.

“Domestic GPU manufacturers are competing alongside NVIDIA while wearing ‘shackles,’” a GPU industry insider once described to Miaotou.

Interestingly, under such adverse conditions, the gap between Chinese and US large models has begun to narrow over the past two years, even approaching parity.

By the end of 2023, the performance gap between top Chinese and US models across various dimensions was still around 20%-30%. On April 14th, Stanford University’s HAI Lab released the 2026 AI Index Report, a 423-page industry-authoritative report showing that the performance gap between Chinese and US large models has shrunk to 2.7%, nearly achieving technical parity.

Miaotou believes that if we see the performance gap between Chinese and US AI large models as a result, then NVIDIA GPUs are not the decisive factor.

This is partly due to the rise of domestic chips and the complete infrastructure of China’s power grid.

In a recent interview, Jensen Huang stated, “AI is fundamentally a parallel computing problem. China can fully compensate for the process gap of individual chips by stacking more chips. China has abundant energy; if willing, it can assemble more chips together, even if the process lags by a few nanometers.”

In fact, many domestic GPU manufacturers have already achieved clusters of ten thousand cards to make up for the shortfall in single-card computing power. For example: Moore’s Quwa’s ten-thousand-card cluster, and Muxi’s Xiyuan No.1 SADA ten-thousand-card cluster.

On the other hand, the emergence of large model companies represented by DeepSeek is also a key factor.

DeepSeek uses forward-looking software design to actively adapt to and empower domestic hardware, paving the way for domestic chips.

For example: DeepSeek-V3 verified the usability of FP8 in large-scale model training, expanding training scale without additional costs and without affecting training quality.

To illustrate: in the past, completing a complex AI computation task required several large, precise, and expensive German-imported machine tools (representing NVIDIA’s high-precision GPUs). Now, DeepSeek changes the processing flow (i.e., data format), enabling this task to be efficiently completed by dozens of small, simple, and inexpensive domestic machine tools (representing domestic GPU compute units).

Even so, NVIDIA GPUs still give overseas large models an edge in training.

But from an industry evolution perspective, large model training is only the first stage. Once the models are built, the real determinant of commercialization speed and industry penetration is inference—especially after the rise of agents like Openclaw and Hermes.

NVIDIA won in training, but inference is just beginning

Training and inference are two different modes.

The explosion of Claw-like agents is driven by their core ability for long-context memory.

Previously, AI only chatted and then forgot, like a fish’s memory; but Claw can remember everything, keep working, and get better with use, turning it from a “toy” into a “tool.”

As context length increases, agent memory deepens, and tool calls become more frequent, GPU memory can be overwhelmed by KV cache (memory cache), causing a decline in large model inference quality.

Therefore, the first bottleneck in the inference explosion isn’t insufficient computing power, but “memory” and “computation” competing for the same GPU memory.

For domestic GPUs, the bottleneck isn’t peak TFLOPS, but memory. NVIDIA GPUs lead other manufacturers by 1-2 generations in memory technology.

Mainstream data center GPUs from NVIDIA (like A100, H100) typically come with 80GB of memory per card, while the latest Rubin GPU features 8 HBM4 memory chips of 36GB each (total 288GB), with a total bandwidth of 13 TB/s.

Domestic chips are limited by advanced process technology, with lower memory capacity and bandwidth, still needing breakthroughs. For example: Huawei’s Ascend 910B has 64GB of memory.

According to a paper published earlier by Liang Wenfeng, DeepSeek-V4 is likely to adopt a unique Engram architecture, which specifically addresses the memory capacity bottleneck.

DeepSeek-V4’s approach is to extract static “rote” knowledge from the model and store it in a huge memory table; during inference, the CPU handles “dictionary lookups” (knowledge retrieval), while the GPU focuses on “logical reasoning” (computation).

These two processes run in complete overlap. When the GPU is calculating the logic for the next word, the CPU has already retrieved the necessary knowledge. Thanks to this parallel architecture, the delay is fully masked, and the AI’s output efficiency per unit time increases geometrically, preventing GPU memory from being overwhelmed by KV cache.

For example: a long-context inference task that would normally require 80GB of memory might only need 8GB under the Engram architecture.

This means that domestic GPUs can perform the same tasks under memory constraints, while the HBM memory scarcity NVIDIA prides itself on could collapse. Meanwhile, CPUs will also see a surge.

Furthermore, it’s worth noting that DeepSeek-V4 is about to be released, and unlike industry tradition, early testing permissions haven’t been granted to NVIDIA. Instead, Huawei and Cambricon have been given the opportunity for early adaptation. The goal is to migrate from the CUDA ecosystem to Huawei’s CANN framework.

Although NVIDIA’s CUDA ecosystem won’t be replaced in the short term, cracks are already appearing. This indicates that DeepSeek still holds a strong ecological niche both in open-source and domestic autonomous ecosystems.

According to media reports, to meet the demand for cloud services based on this model, tech giants like Alibaba, ByteDance, and Tencent have already placed orders for Huawei’s new generation AI chips, with order volumes reaching hundreds of thousands of units.

It’s foreseeable that the upcoming DeepSeek-V4 will also bring new expectations for AI investment.

New investment expectations

From an investment perspective, Miaotou believes DeepSeek-V4 will directly benefit two major directions: domestic computing power and AI applications.

1. Domestic Computing Power

If DeepSeek-V4 is confirmed to be trained entirely on domestically produced computing power, it will mark a “DeepSeek moment” in China’s chip industry. It proves that even without H100, we can run world-class large models.

The marginal change this brings is beyond expectations. It’s comparable to Google training Gemini with its self-developed TPU chips. Notably, Google has become a Berkshire Hathaway holding position for Buffett.

Previously, market expectations for domestic computing power mostly stayed within the “independent controllability” grand narrative. V4 will push this logic toward “useful and essential” commercial logic.

The biggest beneficiaries this time will be domestic GPU manufacturers. Huawei and Cambricon have already made this clear. Other domestic GPU vendors will also actively adapt to DeepSeek large models. From a certainty standpoint, the most assured beneficiaries are Chinese chips, servers, and related supporting manufacturers represented by Huawei and Cambricon.

Looking ahead to 2026, five listed AI chip companies—Cambricon, Biren Technology, Tianshu Zhixin, and others—are expected by Wind to see revenue growth of approximately 120% year-on-year, reaching about 25.7 billion RMB.

Additionally, in terms of flexibility, Muxi expects to turn losses into profits by 2026, potentially becoming another profitable domestic GPU manufacturer after Cambricon, completing the commercial cycle.

Therefore, domestic computing power will remain a key focus for AI investment.

2. AI Applications

Besides adapting inference to domestic computing power, DeepSeek-V4 may further reduce training and inference costs through innovative architectures (mHC and Engram technology), accelerating the cycle of AI value chain innovation in China.

At the same time, DeepSeek is expected to help global large language model and AI application companies accelerate commercialization, alleviating the increasing capital expenditure pressures.

With the deployment of the Engram architecture, GPU memory demand drops by 90%, significantly reducing hardware costs for inference. This is a major benefit for terminal deployment (edge AI inference).

Moreover, since January this year, the AI application sector in A-shares has been sluggish, mainly due to fears of “large models swallowing software.” AI applications have entered a “killing logic” phase.

However, the release of DeepSeek-V4 may improve this sentiment. For domestic A-share application companies, large models are more like a cheap infrastructure, helping optimize costs.

Miaotou believes that AI application companies closely tied to core data and related cloud service providers will also see marginal improvements.

Summary

NVIDIA remains the most powerful infrastructure for training large models, with no doubt. In the short term, its advantages in high-end training GPUs, CUDA ecosystem, and cluster capabilities will be hard to replace.

It’s important to note, however, that NVIDIA’s dominance is gradually being eroded by DeepSeek’s “curve salvation” approach.

DeepSeek-V4’s early adaptation to domestic chips and ongoing innovation are attempting to prove that AI inference doesn’t necessarily have to rely solely on the most expensive GPUs. System-level optimization, software-hardware synergy, and localized deployment can also open new pathways. Meanwhile, domestic computing power can take another step forward.

Don’t overestimate NVIDIA, and don’t underestimate DeepSeek and domestic computing power.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
176.18K Popularity
#
CryptoMarketSeesVolatility
238.29K Popularity
#
rsETHAttackUpdate
77.86K Popularity
#
US-IranTalksStall
191.02K Popularity
#
ETHMemeCoinFLORKSurges
41.98K Popularity

Sitemap

Don't overestimate Nvidia, don't underestimate DeepSeek

Dancing with “shackles”

NVIDIA won in training, but inference is just beginning

New investment expectations

Summary

Trending Topics

WCTCTradingKingPK

CryptoMarketSeesVolatility

rsETHAttackUpdate

US-IranTalksStall

ETHMemeCoinFLORKSurges

Pin