AI Semiconductor Endgame: Will the Shortage Last at Least Another Five Years?

Question

> Original title: "AI Semiconductor Endgame Deduction 2026 (II)" > Original author: fin, AI Analyst When the structural evolution of semiconductors reaches the AI inference main line, memory and storage become the biggest bottlenecks. The market's biggest doubt about memory and storage is: **Will HBM/DRAM/SSD break free from the traditional cycle?** Will the evolutionary path of GPU architecture, which relies on exponential growth of HBM, stop? When will it stop? What impact will CXMT's capacity expansion have? Will it drag the market back into the cyclical quagmire? **This article attempts to establish a framework to sort out these issues** Everything is cyclical, and memory is particularly cyclical. The biggest source is the excessively long capacity expansion cycle, which makes it impossible to quickly ramp up production and mismatches with periods of demand shortage. Several possible ways to break free from traditional cycles: 1. Customization: Products are not interchangeable, capacity cannot be easily transferred, and long-term contracts are required. 2. Structural exponential demand growth: The demand curve itself is very steep, and supply keeps falling behind. 3. Rapid technology iteration upgrades: Each generation of products quickly eliminates the previous generation. Fulfilling any one of these can partially break free from traditional cycles; fulfilling two or three can break free from most traditional cycles. According to this framework, HBM accounts for about two and a half of the three conditions. 1. Customization, requires signing long-term contracts (Weak, counts as half)==================== HBM does have some elements of customization and co-design with Nvidia, but it is not very strong. The truly customized parts are only in the packaging and base die; the dozen or so layers of DRAM die above are still fully JEDEC standardized. For example, when Samsung's HBM3E failed Nvidia's qualification and its share dropped from about 60% to 20%, it did not scrap that capacity as a loss but instead supplied it to Google's TPU and AMD. Physically, the HBM3E for Nvidia and the HBM3E for AMD are the same thing. So capacity is still partially freely transferable. After HBM4, there will be more customization, including integrating custom logic and/or caches on the base die. A more complex approach is to directly integrate HBM4E memory controllers and custom die-to-die interfaces into the logic base die. SemiAnalysis mentioned that OpenAI, Nvidia, and AMD are each working on customized HBM, but this refers to customization of the base die, while the DRAM layers above are still standard. Due to the partially customized nature, HBM requires cooperation mainly in packaging, which also forces customers to sign long-term contracts. But capacity can indeed be transferred, so HBM can barely count as half. 2. Structural exponential demand growth (Satisfied)================= The most intuitive reason is the hardware upgrade demand for Nvidia's token factory token throughput, leading to extremely fast generational upgrades in HBM bandwidth and exponential growth in HBM size demand. This point is actually the conclusion from the previous article "AI Semiconductor Endgame Deduction 2026 (I)": Token throughput = HBM size × HBM bandwidth, doubling each generation. HBM size per GPU grows by over 40% per year. The steepness of this demand curve makes it difficult for the DRAM supply side's 14% wafer growth multiplied by 9% density improvement to catch up. In the hardware domain, the high bandwidth and high memory size requirements of KV cache in the attention phase also give HBM a unique position. Even if HBM prices triple or quintuple, the marginal token throughput improvement from spending money on HBM is still far more cost-effective than spending elsewhere. Other memory routes like SRAM, HBF, CXL, PIM currently cannot compete head-on with HBM in its main track of KV cache/attention, and it is unlikely to find an alternative route for at least the next 5 years or even longer. 3. Rapid technology iteration upgrades (Satisfied)=============== The DDR3 era lasted 15 years, and we are still only in the DDR5 era. But HBM's upgrade cycle is basically every two years, much faster than traditional DDR, and the trend seems to be accelerating recently. HBM size × HBM BW doubles each generation, which currently fully conforms to this pattern. With HBM upgrades every two years, Nvidia GPU speed basically increases exponentially: 2TB/s -> 3.5TB/s -> 4.8TB/s -> 8TB/s -> 22TB/s. Moreover, HBM speed is perfectly linearly proportional to inference token throughput. The marginal cost of using the previous generation HBM becomes uneconomical, and everyone is motivated to use the latest products. Although they are more expensive, the benefits (token throughput) they bring are greater. The logic of the token factory era is that the more technology upgrades (HBM bandwidth), the more you earn. This speed difference creates a situation similar to CPUs: old products depreciate quickly, making hoarding less valuable. For example, the value of HBM3 depreciates very fast; basically, mainstream products don't use it today. Therefore, the rational choice for HBM manufacturers has shifted from competing for current capacity to capture market share (quantity competition) to competing in stability and HBM speed, vying for qualification share on Nvidia's next-generation platform (quality competition). This avoids the prisoner's dilemma in the downward phase of the traditional cycle, where everyone is reluctant to cut production and lose market share. **Comparing HBM with traditional DRAM, it satisfies two and a half of the three conditions. So, can HBM break free from the traditional cycle?** The source of memory cyclicality, according to the mainstream narrative, is that DRAM has commodity attributes (no differentiation → price war → hoardable inventory), hence cyclicality. But commodity attributes themselves do not create cycles; they are just an amplitude amplifier. Especially in the DRAM field, there was once a prisoner's dilemma: during a downturn, Samsung expanded production to grab market share. Whoever cut production first lost, so no one dared to cut production easily, leading to heavy losses for everyone. In fact, the main structural source of cyclicality is that the supply cycle is too long, easily misaligned with the demand cycle. Building a fab takes 3 years and costs tens of billions of dollars. Once a decision is made, it is irreversible, and demand growth is unstable. Whenever a new paradigm growth emerges, such as cloud services, mobile internet phones, or pandemic online demand, there is explosive growth. But after two years, growth slows down, supply exceeds demand, prices fall too sharply, and it becomes a loss-making cycle. Everything is cyclical, and HBM cannot avoid this either. However, as long as token demand continues to grow exponentially, structural exponential growth will weaken cyclicality because demand is more predictable. Moreover, once prices drop, customers have the demand to increase HBM size (thereby increasing token throughput). Combined with HBM's customization requirements leading to long-term contracts, this transforms cyclicality into growth cyclicality, and this cycle will be particularly long. · Cyclicality: Earn a lot during the upcycle, lose a lot during the downcycle.· Growth cyclicality: Earn a lot during the upcycle, earn less during the downcycle. Additionally, besides the three conditions for breaking free from traditional cycles, HBM/DRAM has another important advantage: 4. Because DRAM density scaling is getting slower, and HBM upgrades lead to an increase in the number of stacked DRAM layers, the difficulty of expanding supply side capacity continues to increase.==================================================================== Around 2000, DRAM bit density per wafer grew by about 45% per year. That is, even without increasing wafer count, the annual supply side DRAM bits could still grow by 45%. Ten years ago, DRAM bit density growth dropped to 20% per year. Now, DRAM bit density growth has dropped to 9% per year. In the past, DRAM capacity expansion didn't even require much new fab construction to achieve a 20-30% annual increase in bit volume. Now, DRAM capacity expansion relies more on wafer count growth, i.e., building new fabs and clean rooms. Another difficulty in rapid HBM capacity expansion is that HBM3e requires about 3 times the DRAM wafer count, and HBM4, due to increased stacking density, requires about 4 times the DRAM wafer count. This means HBM bits are becoming increasingly harder to manufacture relative to DRAM bits. The number of HBM bits produced per unit DRAM wafer is decreasing, effectively making HBM deflationary. Will HBM one day regress from growth cyclicality back to traditional cyclicality? The most important factor is structural exponential growth. So, In the AI inference era, will this GPU architecture evolution path relying on exponential HBM growth stop? When will it stop? Token throughput = HBM size × HBM bandwidth. The reason for HBM size growth in this first principle of HBM exponential growth is precisely the growth of KV cache. The characteristics of KV cache and Attention are also very well-suited to HBM. This even allows HBM to lead other technology routes, maximizing the utilization rate of KV cache and Attention phases. In other words, if KV cache architecturally disappears, then the logic of exponential HBM size growth will also be challenged. So the essence of this question is: Will this round of attention mechanism represented by Transformer, and the KV cache mechanism derived from it, disappear? Will it be replaced after the tide recedes? From historical patterns: In every AI model architecture revolution, what truly survives are those primitive operations that are mathematically universal in some sense. For example, FFN (feedforward network, i.e., the massive MLP layers in models) originated in the deep learning era of 2012, but it has survived all the way into today's large language models and still occupies a significant portion of model parameters. Why did it survive? Because it is also a universal approximation theorem: any sufficiently wide MLP can approximate any continuous function. Attention is likely also such a primitive that will be retained. Because it solves a similarly fundamental problem: dynamic routing between any two positions in a sequence, allowing any two positions in a sequence to establish connections as needed. Once this capability is proven effective, it is hard to abandon. So even if future architectures evolve from pure Transformer to hybrid architectures or world models, attention layers will still exist, KV cache (or its equivalent after latent compression) will still be needed, and HBM will still be one of the cores of inference. The evolution path of GPU KV cache architecture relying on exponential HBM growth will not stop. What about DRAM? Is there a possibility for it to break free from traditional cycles in the future?=========================== There is some market consensus on HBM breaking free from cyclicality, but for DRAM breaking free from cyclicality, there is currently almost no market consensus. Returning to the framework just discussed, among the three conditions for breaking free from traditional cycles, DRAM has no customization. So we can only look at technology iteration speed. The most critical point is whether there is structural exponential growth. The answer is yes. In the concept of the AI token factory, structural exponential growth is indeed mainly for HBM. But things changed after the end of 2025: as agentic CPU begins to unleash its potential, the DRAM demand attached to CPUs is becoming a new source of structural exponential growth for DRAM. The growth logic of this part has two layers: **The first layer is the rapid growth of CPU server TAM. The second layer is the rapid growth of DRAM per server CPU core due to agentic flow.** **The four logics for the rapid growth of server CPU TAM were detailed in the April CPU special article. Briefly:** 1. The CPU-to-GPU ratio in AI accelerator clusters is changing from the traditional 1:4 to 1:2, and may even move toward 1:1. 2. In agentic flows, the latency proportion handled by CPU is very high, 50-90%, becoming a major bottleneck that requires simultaneous scaling. 3. AI coding significantly improves SDE efficiency, with code volume increasing by orders of magnitude and software API calls growing exponentially, directly translating into exponential growth in CPU hours. 4. Sandboxes for data security and isolation, e.g., Analytical Agents, need to replicate large amounts of databases and user contexts for each task, leading to severe waste of memory (DRAM) and CPU cores. This waste problem cannot be solved for at least five years or more. Additionally, CPU hours are technically difficult to deflate through optimization. This is why, the quarter before last, AMD's earnings report said CPU TAM would reach $60B by 2030. Two months ago, AMD/ARM doubled their 2030 CPU TAM forecast to $120B. A month ago, Nvidia again doubled their 2030 CPU TAM forecast to $200B. Last week, Bernstein again raised their 2030 CPU TAM guidance to $223B. In my view, it's almost certain that the 2031 CPU TAM will be revised up to $400B in the future. The only question is when the major players will announce this upward revision. **Second layer, why is the DRAM capacity per server CPU core growing rapidly in the agentic era?** 1. Agents are stateful long-running processes, not stateless request-response.------------------------------ Traditional web/SaaS is stateless: a request comes in, memory is allocated, processed, and immediately released. But an agent task can run from one minute to one hour. During this entire time, its message history, system prompt, working memory, long-term memory, and tool result buffer are all resident in DRAM. Like CPU hours, the memory footprint per task, due to statefulness and sandbox isolation (each task replicates database and context), is technically difficult to compress. 2. Context windows are growing exponentially, the working set per session expands accordingly, and concurrency × single-session memory footprint creates a multiplier effect.----------------------------------------------------------- Context window: 32K → 256K → 1M. The sequence length for reasoning/test-time compute is exploding, and will continue to increase in the future. The resident messages per active session grow linearly with context length. Now multiply the two layers. First layer: CPU server TAM, looking toward 2030~2031, is about 5–7 times (60B → 120B → 200B → 223B, and I believe it will reach 400B). Second layer: DRAM per CPU core, about 3–4 times (4~8GB → 16~32 GB/core), but this growth is mostly a one-time dividend. Multiplying the two independent variables, server-side DRAM demand grows by an order of magnitude.================================= By 2030, even with a conservative 300B CPU TAM, assuming $50 per CPU core, and a conservative 16GB/core in the agentic era, the new demand is at least 96EB. This year's total DRAM production is only 47EB, and next year barely 60EB. This is an astonishing increment. Although this exponential growth in DRAM driven by agentic CPU is largely a one-time dividend in the second layer, it will last for a very long time because the gap is so huge. Back to the framework at the beginning of the article. Among the three conditions for breaking free from traditional cycles, the first condition, DRAM customization, is basically negligible. The second condition: a structurally exponential and hard-to-reverse demand source does exist. Commodity DRAM now also qualifies to partially break free from traditional cyclicality. Not as thoroughly as HBM (two and a half), but it is already a substantive change. The third condition, technology iteration speed: DRAM's pace is also different from before. In the past, DRAM technology iteration speed was heavily dependent on consumer electronics. Progress in DDR did little for performance. But in the foreseeable future, carbon-based consumer traditional DRAM will be far smaller than silicon-based consumer (CPU server) DRAM volume. Previously, the marginal benefit of DRAM speed upgrades was very low. But now, due to increased CPU server demand for memory and increased demand for DDR speed in edge AI (e.g., Apple needing faster LPDDR for local large models), the marginal benefit of speed upgrades has become much higher. So the iteration speed requirements for DDR6 and LPDDR6 have increased significantly compared to before. This can also be seen in the figure; the iteration time for LPDDR6/DDR6 has shortened, and the speed slope is rising again. In the past, when a new generation of DDR/LPDDR technology came out, people were indifferent and waited for prices to drop before adopting. Now, with LPDDR6 coming out, everyone is scrambling to get on board as early as possible because the performance improvement from speed increases is tangible. Additionally, DDR supply is further taxed by HBM. HBM's expansion is too fast each year, pulling a batch of wafers that could have been used for commodity DDR into HBM production. And HBM's conversion ratio is extremely low: HBM3E takes about 3 times the DDR wafer capacity to produce the same bit quantity; HBM4 takes 4 times. So each year, about 3% to 5% of DDR bit growth is directly consumed by this HBM bit tax. Therefore, although DRAM bit volume will grow by about 24% per year in the future (14% from wafer growth, 9% from per-wafer DRAM density growth), considering the HBM bit tax, traditional, non-HBM commodity DDR will see annual bit growth of only about 20% (about 10% wafer growth × about 9% node density improvement). What is the impact of China's CXMT capacity expansion? If they expand aggressively without following the rules, will it drag the market back into the cyclical quagmire? CXMT's expansion has been quite fast in recent years. In 2025, it's still 200K wafers per month. By 2026, with contributions from the Beijing fab and new production lines, it could reach 320K-350K. The Shanghai fab currently under construction, Phase I and Phase II: Phase I is expected to add 100K wafers per month by 2027, and Phase II is expected to add 100K wafers per month by 2028. That means 420K wafers per month in 2027 and 500K wafers per month in 2028. However, it's worth noting that CXMT's DRAM bit density is only about half that of the Big Three. So CXMT's 500K wafers per month can only produce half the DRAM bit volume of other players. When calculating wafers per month, it's counted as half equivalent. With this discount applied, CXMT's impact on the entire DRAM industry is much smaller. From the end of 2025 to the end of 2028, CXMT's impact on DRAM bit capacity CAGR is only about 1.5%, raising the industry-wide DRAM capacity CAGR from about 12.7% to 14.2%. · DRAM monthly capacity (kwspm) 2025E → 2028E CAGR· Samsung 685K → 920K 10.3%· SK Hynix 519K → 725K 11.8%· Micron 340K → 560K 18.1%· Other non-China 150K → 218K 13.3%· China (density halved) 117K → 274K 32.8%· Total including China 1811K → 2697K 14.2%· Total excluding China 1694K → 2423K 12.7% Even if CXMT maintains its expansion pace in the future, its impact on the industry's annual DRAM bit volume CAGR by 2030 is probably less than 3%, changing it from 20% CAGR to 23% CAGR, nothing more. Additionally, CXMT is constrained by lithography machines. DDR6 requires higher speeds (starting at 14400 MT/s) and higher density. The Big Three will likely use 1c or more advanced nodes (~12nm or below) for DDR6, fully utilizing EUV. CXMT may be limited in speed for DDR6, and its density is only half. Even with growth cyclicality, why will this DRAM super cycle last a long time, at least five years without end?========================================== The first reason is the massive increase in CPU server demand mentioned earlier, bringing structural exponential DRAM demand growth. Combining this with the supply side's DRAM bit volume CAGR of about 20%, it becomes clear why the DRAM gap will widen in the coming years: Non-HBM traditional DRAM supply grows about 20% per year. On the demand side, based on a 60B CPU TAM in 2026, with each CPU consuming an average of 8GB/core and each core costing $30~35, the demand is 16EB. By 2030, assuming a 400B CPU TAM, with each CPU consuming an average of 16GB/core and each core costing $80 (CPU price more than doubles), the demand is 80EB. The CAGR for this part of DRAM demand is about 50%, far exceeding current estimates. Unlike HBM, which is directly tied to token throughput and thus directly to GPU profitability, insufficient DRAM affects agent flow mainly in speed. For example, with 8GB/core vs. 16GB/core, some workloads may slow down by 30%. Some low-value tasks can tolerate waiting. The motivation for structural exponential growth is strong, but the demand is not as rigid as GPU demand. SemiAnalysis said the DRAM gap this year is a single-digit percentage, and next year it will be over 10%. From the perspective of structural DRAM growth due to the surge in agent CPU numbers, this gap will continue to widen each year, with no possibility of reduction before 2030. Another reason DRAM will remain strong for a long time is that the demand eliminated by DRAM price increases is not truly gone; it's just delayed. There are too many demand reservoirs. The so-called demand reservoirs refer to "potential demand that would be immediately released once memory prices drop." Their existence means that even if supply catches up temporarily, prices are unlikely to collapse because new demand will always flow from these reservoirs to take the baton: **Memory computing power/speed is a demand reservoir:** There is a large amount of demand that relies on additional memory to optimize speed and computing power. When memory is too expensive, this demand is suppressed; once memory prices drop, it will be released. For example, Nvidia's CPX prefill accelerator was originally designed to use additional low-cost GDDR7 as a dedicated prefill accelerator. But because LPDDR/GDDR is too expensive, even more expensive than pre-increase HBM, the ROI of this solution is no longer worthwhile. However, when common memory prices drop, similar optimization solutions like CPX will return. **Low-value tasks are a demand reservoir:** When memory prices rise, token prices remain high. High-value tasks are prioritized, while low-value tasks are postponed. Once memory prices drop, this delayed demand returns. **Edge AI is a demand reservoir:** AI PC memory configurations may rise from 24GB all the way to 128GB. Apple has already required that the latest full-strength edge AI needs a memory upgrade from 8GB to 12GB. Conventional consumer electronics, Agent PCs, and low-end phones all have reduced demand due to memory price increases, and all are demand reservoirs. Stacking so many reservoirs together creates an extremely thick demand cushion. This is why the structural growth of DDR this round will have more staying power than the market expects. Another reason DRAM prices are unlikely to drop significantly is that HBM and DRAM capacity can be swapped, so the entire DRAM complex is re-rated together. During an upturn, DRAM margins far exceed HBM. HBM price increases are even driven by DRAM. The newly contracted HBM4 price this year is current DRAM price × 4, which is the normal stacking multiple for HBM4. Once DRAM prices drop and margins decline, because HBM long-term contracts are transparent and margins are guaranteed, HBM will indirectly pull more DRAM capacity away. HBM price drops will also give GPU manufacturers more incentive to upgrade HBM size as much as possible, indirectly protecting the floor price for DRAM. DRAM has structural exponential demand growth, density scaling is slowing, expansion difficulty is increasing, manufacturers are cautious with expansion plans, CXMT's impact is limited in the coming years, and the demand reservoirs are enormous. These four reasons together lead to the conclusion that in the foreseeable future, at least five years or more, DRAM will be hard pressed to enter a cyclical trough. Can NAND SSD break free from traditional cycles?===================== NAND's structural growth driver is not as strong as DDR. This year's shortage is mainly due to the major players maintaining good production discipline without large-scale expansion. Annual capacity increases mainly come from technological improvements: increasing NAND stacking layers. The first structural growth comes from AI, mainly KV cache offloading—moving warm/cold KV cache that overflows from HBM to NAND SSD. But surprisingly, this KV cache offloading growth hasn't even happened on a large scale yet, and SSDs are already in shorter supply than DRAM, with higher price increases than DRAM. Next year, when Rubin CMX ramps up and KV cache offloading is widely adopted, SSD shortages will also grow due to this structural growth. Second, another structural growth mentioned in last year's annual summary—AI video—is already gaining traction this year. Seedance's volume is growing at a rate of 10 to 40 times per year. It's currently still stuck in the stage of insufficient GPU computing power, with demand suppressed by compute constraints. But once the GPU shortage passes, the structural demand for NAND storage from AI video will last for a considerable time. The third structural growth also comes from the exponential increase in Sandbox usage driven by agent flows. Sandboxes, for data security and isolation, require copying large databases and user contexts for each task, leading to severe waste of memory (DRAM) and CPU cores, as well as significant SSD waste (demand). The fourth structural growth, which may take effect after 2030, comes from the HBF route requiring SSDs. It is considered promising in many investment bank analyses, but this technology route is still far away. Its main role is to store large model weights, written once and then read-only, and it must be packaged together with GPU/HBM (48TBps/96TBps); otherwise, PCIE7/8 speeds are too slow to be usable. It's only something to look forward to. The next article, "AI Semiconductor Endgame Deduction 2026 (III)," will have a more detailed analysis. In summary, NAND SSD's structural growth is not as strong as HBM, but its advantage is being cheap. By 2027, the price will still be only $0.8/GB, one-fortieth of DRAM's price. So it acts as a versatile component in the multi-level cache hierarchy, with too many sources of structural growth. That is to say, it is impossible for DRAM/HBM to rise alone while SSD does not rise. If such a situation occurred, people would find ways to use SSD to take over part of DRAM/HBM's functions, achieving similar effects at lower cost. HBM, DRAM, and NAND are not three independent stories but structural growth in different temperature layers of the same AI memory hierarchy. With structural exponential demand growth, has NAND SSD escaped the cycle? Then we need to look at the production discipline of NAND SSD manufacturers. The only one that might not follow production discipline is YMTC. After all, this is a prisoner's dilemma. Once one player aggressively expands without following the rules, it is much easier for the entire NAND industry to expand capacity than for DRAM. But at the very least, this round of NAND is also a super cycle. The demand from several structural exponential growth sources will make it easy to postpone the downturn to 2030. > Original link Click to learn about the open positions at BlockBeats **Welcome to join the official BlockBeats community:**Telegram Subscription Group: https://t.me/theblockbeatsTelegram Discussion Group: https://t.me/BlockBeats_AppTwitter Official Account: https://twitter.com/BlockBeatsAsia

AI Semiconductor Endgame: Will the Shortage Last at Least Another Five Years?

What about DRAM? Is there a possibility for it to break free from traditional cycles in the future?

Multiplying the two independent variables, server-side DRAM demand grows by an order of magnitude.

Even with growth cyclicality, why will this DRAM super cycle last a long time, at least five years without end?

Can NAND SSD break free from traditional cycles?

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇨🇴vs🇵🇹

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned