In recent years, as AI businesses such as large-model inference, in-memory databases, and high-performance computing have expanded rapidly and at scale, data centers are being pushed to the critical point of memory resources. DRAM, once a standard server component, has now become the most expensive and scarcest infrastructure resource. Soaring prices and rigid supply have become key factors constraining the pace of AI computing deployments.

According to tracking data from Counterpoint Research, the price of 64GB DIMM memory has risen 3.5 times between the third quarter of 2025 and the first quarter of 2026, and the uptrend has not topped out yet. It is expected that by the third quarter of 2026, the cumulative increase will reach 5 times.

TrendForce’s data is even more intuitive: In Q1 2026, the quarter-over-quarter increase in DRAM contract prices reached as high as 93% to 98%, driving the global DRAM industry’s overall revenue to grow 81% quarter-over-quarter to $97 billion. Entering the second quarter, the surge still shows no signs of stopping, and contract prices are expected to rise another 58% to 63%.

Signals from the spot market are even more direct: the current spot price range for server-grade DDR5 RDIMM is $27 to $37 per GB. Even building just a 12TB memory pool, the pure DRAM hardware procurement cost is already close to $500,000.

DRAM Crisis, Fully Erupting

The root cause of this round of price hikes is the continued erosion of DRAM production capacity by HBM.

According to disclosed data, with the explosion in demand for high-bandwidth memory driven by AI training and inference, HBM’s share of DRAM wafer capacity has risen from 2% in 2020 to an estimated 25% in 2026. Samsung, SK hynix, and Micron—the three major original manufacturers—are all shifting their high-quality capacity toward HBM with higher gross margins. From 2025 to 2027, the proportion of HBM die shipments out of total DRAM shipments is expected to be 18%, 22%, and about 30%, respectively. One HBM wafer consumes production capacity equivalent to about three DDR5 wafers. The three manufacturers actively cut low-margin orders for mobile phones and PCs, redirecting all capacity to AI. Moreover, taking into account that hyperscale cloud providers also lock in future wafer output with multi-year long-term orders, the standard DRAM supply available for the server market is further compressed.

And the rigidity on the supply side means shortages are unlikely to be relieved in the short term.

Advanced DRAM process development heavily depends on EUV lithography machines. The price of a single machine is as high as about $200 million. A modern wafer fab typically requires investment of several tens of billions of dollars. Even if everything goes smoothly, the construction cycle still takes several years. The pace of capacity expansion is far behind the growth trajectory of AI demand.

Jefferies estimates that, excluding the impact of domestic manufacturers, global storage bit supply growth in 2026 will be only 7% to 8%. The combined DRAM and NAND supply gap could be approximately 150,000 to 200,000 wafers per month. Micron Technology said in its fiscal Q3 2026 earnings report that even if industry supply may gradually improve in 2028, it remains difficult to judge when storage supply will catch up with continuously growing demand.

In addition, the pressure has long spread from data centers to the consumer side.

Xbox CEO Asha Sharma has publicly stated that over the past two years, memory costs have increased by about five times, directly preventing the company from producing enough game consoles to meet market demand. Apple has also announced sequential price increases for products including the iPhone, Mac, and iPad.

Morgan Stanley analyst Shawn Kim’s team went further, stating bluntly that the surge in memory prices and supply scarcity are evolving into a comprehensive risk for the digital economy: “from bottlenecks in AI infrastructure to hardware profit margins, device affordability, cloud costs, inflation, and even the policy level.”

Changes in DRAM’s share within the server bill of materials provide even clearer evidence. In 2023, DRAM accounted for about 50% of a server’s total cost. By mid-2026, this proportion has risen to 60% to 90%, averaging about 75%. CPU prices have not fallen, but against the backdrop of exploding memory prices, the increase in CPU prices looks relatively negligible.

What’s even more ironic is that the memory purchased at high cost is not utilized very efficiently. Test data from hyperscale players such as Meta shows that in data centers, memory is generally only about half-filled with active “hot data,” while large amounts of cold data remain occupying expensive DRAM resources for long periods of time.

Faced with the high cost and scarcity of DRAM, industry players have started to look for alternative routes. They no longer simply stack more hardware; instead, they use technical approaches to reduce reliance on DRAM.

AMD: AI Predictive Scheduling, Making Flash “Invisible” as Memory

AMD chose the lightest software-first approach.

In June 2026, AMD announced the acquisition of MEXT, a memory optimization company. The core goal is to introduce AI-driven memory tiering technology to sink cold data from high-cost DRAM down to low-cost NAND flash, enabling cost-effective expansion of effective memory capacity.

It is understood that MEXT was founded in 2023, and its founding team is well credentialed. Co-founder and CEO Gary Smerdon previously served as Fusion-io’s Chief Strategy and Product Officer, an early pioneer in large-scale commercializing flash storage. Over a decade ago, Apple and Meta Platforms were among its major customers.

MEXT has launched an AI-based tiering memory technology to address the memory efficiency bottleneck. This technology can move low-frequency accessed data from expensive DRAM to NAND-type flash, whose cost per unit capacity is far lower, without affecting application operation.

MEXT’s core product is the Predictive Memory Engine, a software-only memory tiering solution. It continuously monitors application access patterns at the memory page granularity and automatically migrates cold data accessed infrequently into NAND flash—where the cost per bit is only about 1/55 of DRAM. At the same time, by using an AI model to learn the access patterns of workloads, it predicts data pages that are about to be called and proactively prefetches them back into DRAM before the application issues a request. This allows software to read data as if it were directly accessing main memory, thereby ensuring that performance is not impacted.

Image source: Nextplat

The entire mechanism is fully transparent to the operating system and upper-layer applications. It requires no modification to any business code and no need for additional dedicated hardware, and can be deployed within minutes.

According to official data, this solution can increase system effective memory capacity by 2 to 4 times, while lowering overall infrastructure costs by about 50%. In typical scenarios such as Neo4j graph databases, EDA simulation, and film rendering, a configuration with a 1:1 DRAM-to-flash ratio can achieve about 95% of the throughput of a pure-DRAM configuration, but at a dramatically lower cost.

MEXT previously conducted comparative tests involving Dell servers and AWS cloud instances:

Comparison chart of Dell computers/AWS with and without MEXT expanded memory (Image source: Nextplat)

When using MEXT memory expansion, the performance and cost-effectiveness of Neo4j graph database at memory-to-flash ratios of 1:1 and 1:3 are as follows:

Image source: Nextplat

Although MEXT’s approach is not revolutionary—concepts like memory tiering and moving cold data to cheaper storage have existed for a long time—previous technologies failed to be deployed at scale in data centers. The key issue was insufficient accuracy of the prediction algorithms. Once predictions miss, when programs need data and it has to be moved from flash back to DRAM, the added latency becomes immediately visible, and the resulting performance loss is simply unacceptable.

MEXT’s breakthrough is using AI models to do this. Its predictive memory engine continuously analyzes memory access patterns, uses AI to determine which data pages are most likely to be used next, and then proactively migrates the data from flash back to DRAM before the application truly issues a request.

For AMD, this acquisition fills a crucial piece in building its end-to-end capabilities. Beyond the EPYC CPU, Instinct GPU, and ROCm software stack, the memory efficiency layer introduced by MEXT enables AMD to offer customers a complete solution from chip-level components to data-flow scheduling. It helps reduce customers’ total cost of ownership, reduces idle time for GPUs “waiting for data,” and strengthens AMD’s competitiveness in the AI infrastructure market.

On the day the acquisition news was announced, AMD’s stock rose by nearly 7% intraday. The market effectively voted in favor of this path.

Of course, it still remains to be validated over time to what extent MEXT’s technology will ultimately be integrated into AMD’s data center products. The physical latency differences between NAND flash and DRAM are objective. Whether software-level AI prediction can truly bridge this gap depends on real-world performance after large-scale deployment.

Apple: On-Device Large Models, Storing the Model in Flash

When data centers struggle with DRAM costs, the consumer side faces the same constraint. Terminals such as phones have extremely limited DRAM capacity, yet they must support inference demand from on-device large models. Apple’s answer is to keep large models resident in flash and load them into memory on demand.

Apple’s latest AFM 3 Core Advanced is an on-device large model with 20 billion parameters. If loaded entirely into DRAM using the traditional approach, it would far exceed the memory limit of consumer devices. Apple solves this with a sparse activation architecture: the complete model is stored entirely in NAND flash. During inference, instead of loading all weights, it selects the expert modules needed for the current inference prompt in one step, and only brings the working set of 1 to 4 billion parameters into DRAM.

AFM 3 Core Advanced model architecture diagram

Unlike traditional MoE models that switch experts token by token—leading to frequent data movement—Apple uses a prompt-granularity routing mechanism, together with a high proportion of shared experts kept resident in DRAM. This significantly reduces the number of exchanges between flash and memory and minimizes loading latency. Combined with optimizations such as instruction-level pruning (IFP) and transformer layer simplification, the peak DRAM usage of the 20-billion-parameter model is controlled within a range of 2GB to 8GB. This further balances memory usage and computational efficiency, effectively solving the problem of MoE’s excessively large DRAM footprint when deploying on devices, enabling it to run smoothly on terminals such as iPhones and deliver “large models with small memory” for on-device inference.

This architecture is not the result of a temporary sprint.

In fact, as early as 2024, Apple’s research team published the paper “LLM in a Flash,” which systematically validated the technical path of storing large-model parameters in flash and scheduling them on demand. While reducing cloud-side compute costs, it also provides feasible memory-architecture support for on-device AI applications. It achieves inference speeds that are 4 to 5 times faster on CPU and 20 to 25 times faster on GPU than naive loading.

As DRAM price increases move from the industrial end to consumer electronics, this solution not only supports the experience of on-device AI, but also reduces devices’ reliance on high-capacity DRAM.

Overall, AMD and Apple’s two approaches evolve in parallel for data centers and the edge, respectively, but they point to the same conclusion: the memory hierarchy for AI inference is being rebuilt. Low-frequency KV caches, model weights, and on-device data will gradually move from expensive HBM/DRAM down to NAND Flash/SSD layers, forming a multi-tier storage architecture.

This architectural shift is creating multi-tier propagation effects across the industry chain. According to Citrini Research, the most direct beneficiaries are NAND manufacturers.

Marvell: Hardware Compression + CXL, Expanding Physical Memory

If AMD and Apple are pursuing routes centered on software and architecture optimization, then Marvell has chosen a hardware-level breakthrough. Leveraging the CXL high-speed interconnect protocol, it uses hardware inline compression to directly increase the effective capacity of physical DRAM.

In June 2026, Marvell released the Structera series of CXL controllers—Structera X (memory expansion controller) and Structera A (near-memory accelerator). Both chips include a self-developed CDB (Compression-Decompression Block) hardware compression module.

It is reported that when writing data into DRAM, the CDB module compresses in real time using a customized LZ4 lossless algorithm; during reads, it decompresses synchronously. The entire process is performed independently within the memory link, without consuming the host CPU’s compute resources, and is completely transparent to upper-layer applications. Depending on the data type, 1GB of physical DRAM can deliver 2 to 3.64 times the equivalent logical capacity. In mixed database workload scenarios, the average compression ratio can reach 3.64:1—meaning less than one-third of the physical memory can satisfy the same business requirements.

In addition, this solution brings two further cost-reduction values: first, reuse of old memory. The Structera X controller supports DDR4 memory input, allowing retired DDR4 memory to be incorporated into the CXL memory pool, reducing the need for additional purchases of expensive DDR5. Second, memory pooling. By using the CXL protocol to break the exclusive control of a single CPU over memory, multiple servers can share memory resources and absorb idle capacity in the system.

Based on the current spot price of DDR5 at $27 to $37 per GB, a 12TB memory pool alone has a pure DRAM hardware cost of close to $500,000. If estimated with a 3x compression ratio, the amount of physical DRAM procurement can be reduced by two-thirds. With one pool alone, savings exceed $300,000.

SanDisk: Put NAND Under the GPU

SanDisk’s solution is even more aggressive—reconstructing the memory architecture of AI chips at the packaging level.

SanDisk is working with SK hynix to push the standardization of High Bandwidth Flash (HBF), aiming to bring NAND flash even closer to the compute core and build a new storage tier between HBM and SSD.

SanDisk’s patented solution proposes a “NAND under GPU” architecture: stacking high-capacity NAND flash directly beneath a GPU or AI accelerator, with HBM stacks surrounding it. By greatly shortening the data transfer distance, it improves the access bandwidth of the flash. According to the plan, HBF will be physically compatible with HBM4, with capacity reaching 8 to 16 times that of an HBM of the same volume, while also offering significant cost advantages. It is designed for read-intensive scenarios such as long-context inference, KV cache, and streaming model weights.

This technology, called HBF (High Bandwidth Flash, high-bandwidth flash), is positioned between HBM and SSD. If you treat HBM like a “reference book” on a desk, then NAND-based HBF is like a “bookshelf” next to the GPU. HBM handles data that must respond immediately, while the NAND storage beneath the GPU stores larger amounts of data and supports repeated reads and writes.

SanDisk aims to develop HBF with bandwidth close to HBM, providing 8 to 16 times the HBM capacity at similar cost. In February 2026, SanDisk and SK hynix officially launched the HBF specification standardization alliance. SK hynix contributes the stacking, packaging, and interposer-layer technologies it has accumulated in HBM, while SanDisk brings its capabilities in NAND and flash design. The two parties plan to release the first batch of HBF samples in the second half of 2026, with applications in AI inference devices in early 2027. The goal is to build a three-tier memory architecture: HBM for ultra-low-latency real-time computing, HBF for large-capacity, high-throughput repeated reads, and SSD for cold storage—each layer performing its respective role.

Of course, for HBF to move toward large-scale commercialization, it still needs to clear multiple hurdles. These include high thermal density resulting from stacking computing chips and NAND, yield challenges from hybrid bonding and complex wiring, and the software ecosystem needed for tiered scheduling of hot and cold data—all of which require time to refine gradually.

According to Korea’s Shin Young Securities, the HBF market is expected to take shape in 2027, reaching a scale of $12 billion by 2030.

For cloud providers with tens of thousands of nodes, this means major optimization of capital expenditures. Currently, Structera is already the industry’s first mass-produced CXL controller with hardware inline compression. The relevant technical solutions have been submitted to the OCP Open Compute Project for standardization, and the future compatibility scope will be expanded further.

A Lesson From the Past: The Unfinished Journey of 3D XPoint

When it comes to using flash to expand main memory, this is not new.

As early as 2015, Intel and Micron jointly launched 3D XPoint storage technology. Its vision precisely matched the industry’s pain points today: creating a new storage medium with performance between DRAM and NAND flash, supporting byte-level addressing, with costs close to flash, and building a new tier between memory and traditional storage.

Unfortunately, 3D XPoint ultimately failed to deliver on its original promise.

Because the process development lagged, its cost ended up catching up to DRAM, while its performance was only several times faster than ordinary flash. In addition, Intel’s closed strategy of binding it to its own Xeon processors meant it could never enter the mainstream market. Ultimately, the project was terminated, and Intel’s flash business was sold to SK hynix. Once a technology full of promise, it became a regrettable footnote in the storage industry.

If Intel hadn’t given up on 3D XPoint back then, how lucrative would the profits be today? Unfortunately, there is no “what if” in history.

In addition, some domestic startups working on in-memory computing and memory pooling solutions are likely to attract more attention next. After all, in the backdrop of DRAM prices remaining high and the supply side being squeezed, whoever can come up with a truly reliable memory optimization solution may secure the next round of entry ticket in the capital market.

Closing Thoughts

From the failure of 3D XPoint to today’s multiple parallel routes, the storage industry’s exploration of memory efficiency has never stopped.

AMD uses AI prediction to schedule hot and cold data. Apple uses sparse activation and flash storage to compress on-device memory usage. Marvell uses hardware compression to make physical memory more effective. SanDisk uses 3D stacking to move NAND beneath the GPU. The technical paths of these four companies are different, but they point in the same direction: the memory hierarchy for AI inference is being restructured—hot data stays in DRAM and HBM to protect performance, while warm and cold data gradually sinks into the flash layer to handle capacity. Multi-media coordination will balance performance and cost.

The high cost of DRAM is forcing the entire industry to “change its behavior under pressure.” But it is precisely this pressure that has also driven a series of eye-catching technological innovations.

Undeniably, the physical latency gap between flash and DRAM always exists, and the real-world performance of various approaches still needs validation through large-scale deployments. However, what is certain is that the era of simply stacking DRAM to solve problems is coming to an end. A more efficient, more tiered memory system is the new direction for the industry.

Source: Semiconductor Industry Observer

Risk Warning and Disclaimer

        There are risks in the market; invest cautiously. This article does not constitute personal investment advice, and it does not take into account the specific investment objectives, financial situations, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at your own risk.

DRAM1.31%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
gStocksTokenizedStocksLive
4.81M Popularity
#
WeakNFPShakesRateHikeOdds
1.07M Popularity
#
PredictWorldCup🇧🇷vs🇳🇴
234.41K Popularity
#
ETHBreaks1700
152.63M Popularity
#
MetaSellsComputeTriggersChipSlump
1.41M Popularity

Pinned

Sitemap

The world has suffered under DRAM for a long time.

DRAM Crisis, Fully Erupting

AMD: AI Predictive Scheduling, Making Flash “Invisible” as Memory

Apple: On-Device Large Models, Storing the Model in Flash

Marvell: Hardware Compression + CXL, Expanding Physical Memory

SanDisk: Put NAND Under the GPU

A Lesson From the Past: The Unfinished Journey of 3D XPoint

Closing Thoughts

Trending Topics

gStocksTokenizedStocksLive

WeakNFPShakesRateHikeOdds

PredictWorldCup🇧🇷vs🇳🇴

ETHBreaks1700

MetaSellsComputeTriggersChipSlump

Pinned