DeepSeek once again acts as a "price butcher," but this time, it's not just slaughtering prices.

Author: Xiaojing

Token is reshaping the value coordinates of the AI era. The DeepSeek V4 preview version is released, once again becoming a “price butcher,” but bringing a new proposition for token pricing. The same amount of tokens can have an order of magnitude difference in actual cost across different systems, and large models are moving toward system-level pricing.

The DeepSeek V4 preview version has finally been released, once again lowering the prices of large models, which aligns well with DeepSeek’s “characteristics.”

V4-Flash pricing: 1 yuan for input, 2 yuan per million tokens for output; after cache hits, input is only 0.2 yuan; V4-Pro pricing: 12 yuan for input, 24 yuan per million tokens for output; cache hit input is 1 yuan, with a limited-time 25% discount at launch, valid until May 5. Both models natively support contexts of around one million tokens.

This weekend, DeepSeek V4-Pro continues its limited-time offer, dropping the price to 25% of the original, with cache hit input prices further discounted by 10%. An AI engineer jokingly said, “After the weekend, DeepSeek V4-Pro is only 0.025 yuan away from being free.”

Currently, it has been exactly two years since the price war starting with DeepSeek V2 in 2024. Over these two years, the inference costs of large models have decreased exponentially, with effective costs considering cache hits and other factors dropping by up to a hundred times.

But today, lowering prices has even greater significance than before. AI has shifted to an agent paradigm focused on long, complex tasks, where each task involves dozens or hundreds of model calls.

In this industry context, the release of the DeepSeek V4 preview also comes with two key points. First, contexts of around one million tokens are now native features of both models; second, cache prices are emphasized, with additional discounts. The combined effect pushes the input/output prices down to the lower end of the same model class, aiming to minimize the total bill for an agent completing a task and maximize competitiveness.


Token now has a new pricing system

Looking back at the price reductions in 2024, fundamentally, it’s about bringing large models from “expensive experiments” into “usable tools.” Relying on architectural innovations that improved inference efficiency, the cost per million tokens, which was between $10 and $30 during the GPT-4 era, was quickly compressed to around $1.

Figure: Exponential decline in token prices over the past two years

This is a typical “absolute price plunge”: developers can invoke large models at low cost, truly opening up application layers. But at that stage, prices still corresponded to “single call costs,” with tokens treated as a unified billing unit, and call frequency roughly linearly related to cost.

Two years later, DeepSeek V4’s pricing structure has also changed. As cache mechanisms become mainstream, tokens are now split into “new computation” and “repeated computation” costs. In scenarios with high cache hit rates, the same input price can drop to one-tenth or even lower. Prices have shifted from static tags to variables closely tied to system design.

Figure: Tokens split into “new computation” and “repeated computation”

If only looking at the listed prices, V4 still continues DeepSeek’s consistent low-price strategy. In the domestic market, comparable models like Alibaba Tongyi, Zhipu GLM, and Moon’s Dark Side Kimi are priced roughly between 1–4 yuan for input and 4–12 yuan for output, while V4-Flash’s 1 yuan input and 2 yuan output are about one-third to one-quarter of the industry average.

The Pro version at 12/24 yuan is close to flagship models, but contexts of around one million tokens are default capabilities, not premium options. Globally, the prices are even more striking, roughly only one-tenth to one-fiftieth of some competitors. For example, GPT-5.5’s official price: $5 per million tokens for input, $0.5 for cached input, and $30 for output per million tokens. Claude Opus 4.7 continues the pricing of Opus 4.6, roughly $5 for input and $25 for output per million tokens.

Although overseas flagship models differ in capabilities, ecosystem maturity, token utilization, and other dimensions, price remains a key factor. In the same agent tasks, cost differences directly impact commercial viability. Overseas vendors also face pricing pressures: Sam Altman publicly admitted that ChatGPT Pro’s subscription is loss-making, and Dario Amodei warned of “overly aggressive pricing” in the industry. To some extent, prices reflect systemic factors like compute supply, R&D amortization, and market strategies.

This is why the current price advantage is more meaningful. In 2024, the industry was focused on “whether it can be used”; today’s agent-centric AI paradigm emphasizes “whether it can run at scale.”

An agent task often involves dozens or hundreds of model calls, with large inputs from system prompts, tool schemas, and historical memories. These components are highly reusable and are also the parts most prone to “cost inflation.”

DeepSeek V4’s key focus is on compressing this “repeated computation” cost.


Figure: DeepSeek V4 turns “cost” into a variable that can be optimized by engineering. The left side shows capability alignment, the right side shows a cost cliff. Under contexts of around one million tokens, inference compute and cache usage drop significantly, making long-term tasks no longer grow exponentially in cost. This is the real driving force behind the current price war.

From the specific price evolution of its own products, this change is also predictable. The previous generation V3.2’s input price was 2 yuan (cache miss), 0.2 yuan (cache hit), with output at 3 yuan; V4-Flash reduced input to 1 yuan and output to 2 yuan, with the most direct change being a halving of cache miss input cost. In multi-round agent scenarios, accumulated input costs often dominate, and this adjustment’s leverage effect far exceeds superficial price reductions.

The Pro version’s 12/24 yuan pricing appears to be an order of magnitude higher than Flash, but DeepSeek’s technical report states, “The Pro version is constrained by high-end compute capacity. After the Hygon Ascend 950 supernode batch deployment and deployment in the second half of the year, Pro’s price will be significantly reduced.” This can be understood as the current price reflecting supply bottlenecks, not actual costs.

The positioning of both models is also clear: Flash targets high concurrency, low latency batch tasks; Pro handles complex agent workflows, long-chain code generation, and deep inference. According to the technical report, DeepSeek has begun evaluating V4’s code agent capabilities with real R&D tasks, directly benchmarking against Claude series internally.


Behind the “price butcher”

How does DeepSeek manage to lower prices?

Traditional attention mechanisms for long texts have computational complexity that grows quadratically with sequence length. For example, processing 1 million tokens involves 128k times more computation than 1M tokens. This was the main reason why “million-context” models were hard to commercialize: KV cache memory usage grows linearly with sequence length, and running at 1 million tokens either requires reducing concurrency or multiplying hardware, which is not cost-effective on paper.

This is also why overseas vendors generally adopt a “short window by default, long window with surcharge” strategy. Anthropic even charges separately for sequences over 200K tokens, doubling the price.

Figure: DeepSeek V4’s CSA (Compressed Sparse Attention) compresses KV caches first, then uses Top-k to select key contexts, computing only the most important information, greatly reducing compute and cache costs in long-text scenarios.

A simple way to understand V4’s approach is stacking “compression” and “sparsity.” First, compress the KV cache of every m tokens into a single entry (CSA compression ratio 4, HCA compression ratio 128), then have each query focus only on the top-k key entries for attention calculation. The first step reduces memory, the second reduces compute, tackling two bottlenecks simultaneously.


Figure: DeepSeek V4’s HCA (Heavy Compression Attention) compresses longer sequence KV caches into a few representations, preserving local window information while further reducing compute and storage costs. This is key to lowering costs for contexts of around one million tokens.

The technical report shows that at 1 million tokens, V4-Pro’s per-token inference FLOPs are only 27% of V3.2’s, and KV cache usage is just 10%; V4-Flash is even more aggressive, with FLOPs at 10% and KV cache at 7%. When combined with FP4 quantization-aware training, Muon optimizer, self-developed mega-kernel MegaMoE, and other infrastructure optimizations, V4 has compressed costs across the entire training and inference pipeline.

Low prices are a natural result of architectural cost efficiencies. A core member of a domestic large model company told Tencent Tech: “The API pricing for domestic large models (including their own) mainly depends on cost capacity. No one is willing to ‘burn money’ on costs. So, the cost advantage at the technical level is extremely important.”

Alibaba Cloud CTO Zhou Jingren also emphasized: “Every price reduction is a very serious process, involving considerations of industry development, developer and enterprise feedback, etc., not just a price war.”


Why is this “price reduction” more important now?

From the demand side, it’s more urgent to systematically lower “costs” now. Deloitte’s latest Token Economics report cites AT&T as an example: after introducing an agent system, the company’s daily token consumption increased from 8 billion to 27 billion. A study by Stevens Institute of Technology pointed out that agent systems face a “quadratic token growth” trap in multi-turn conversations: by the 10th turn, token usage per call can reach 7 times that of the first.

Model prices determine whether an agent can be commercially viable.

CIO magazine, three weeks ago, quoted Ayesha Khanna, CEO of AI solutions firm Addo AI: “If you run a persistent agent connecting to cutting-edge model APIs, with high token consumption, long contexts, multi-step reasoning, and heavy output, the economics will rapidly deteriorate. In some cases, the cost of a single task can be more expensive than having a person do it.” This is the most realistic bottleneck for agent commercialization: the technology can run, but the bills can’t.

Looking at V4’s recent moves, almost all target this industry bottleneck: making contexts of around one million tokens a default capability, so agents no longer pay a premium for long contexts; pushing cache hit input prices down to industry lows, matching the repeated use of the same system prompts in agent scenarios. The technical report also specifically mentions that V4 fully preserves reasoning content in tool call scenarios (V3.2 would discard it at each new user message), to meet the needs of multi-turn agent calls.


Can V4 lower the entire agentic AI industry’s cost line?

Finally, an important question remains: can V4 push down the cost curve of the entire agentic AI industry? The answer is more complex this time.

First, whether other vendors follow suit. If V4 triggers similar synchronized price cuts, the overall industry cost curve will truly shift downward. But as analyzed above, model prices are now more determined by cost structures; model vendors’ gross margins have little room for short-term compression, limiting their ability to follow.

Second, the supply of high-end compute capacity. As DeepSeek’s technical report states, V4-Pro’s current throughput is limited. Whether the low price can be maintained depends on the deployment progress of domestically produced hardware like the Hygon Ascend 950 supernodes in the second half of the year, and DeepSeek’s engineering progress across hardware platforms.

Section 3.1 of the technical report explicitly states that DeepSeek has validated fine-grained expert parallelism on both NVIDIA GPUs and Huawei Ascend NPUs. This is the first time DeepSeek has listed Ascend alongside NVIDIA in hardware validation, attempting to decouple inference paths from reliance on a single hardware platform. If validated, this could have significant long-term value for the domestic large model industry.

Third, whether the token structure in agent scenarios can be further optimized. Currently, agent models are very token-consuming, with a significant portion of waste coming from the architecture itself. Beyond model pricing, how tokens are used in agents is another matter. Even if V4 pushes token prices to the floor, poor agent design can still cause bills to spiral out of control. This underscores the importance of systems like Harness.

DeepSeek V4’s preview indeed lowers prices on the price list, making contexts of around one million tokens a default capability, with output prices below one dollar per million tokens, based on a solid architecture and without relying on subsidies.

But lowering costs industry-wide this time isn’t simple; it faces a more complex systemic challenge.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments