HBM vs GDDR: How High-Bandwidth Memory Can Break Through the "Memory Wall" Bottleneck in AI Training and Inference

In the AI competition where large model parameters have surpassed trillions, GPU computing power is undoubtedly the focus, but a more covert component that determines the upper limit is becoming the strategic high ground—High Bandwidth Memory (HBM). If we compare a GPU to a super-high-performance engine with tens of thousands of cylinders, then HBM is the fuel system that continuously supplies data to it. If the fuel supply can't keep up, even the strongest engine can only idle.

Industry-wide, it is generally recognized that the bottleneck in AI computing power is no longer limited to the compute units themselves but is increasingly constrained by data transfer efficiency. Data shows that in traditional computing architectures, data movement energy consumption often accounts for 60%-80% of total system energy. In inference scenarios, GPU utilization rates can be as low as 1%. The key limiting factor behind this is memory bandwidth.

Thanks to 3D stacking and Through-Silicon Via (TSV) technology, HBM achieves bandwidth and energy efficiency far exceeding traditional memory per unit area, making it the standard for AI accelerators from giants like NVIDIA, AMD, and Google.

Technical Principles: How HBM Reshapes GPU and Memory Data Channels

From “Flat Car” to “Vertical Elevator”

HBM is not a brand-new storage medium but a set of interface and packaging standards defining “how to connect DRAM at extremely high bandwidth.” Its core technological approach can be broken down into three levels:

3D Stacking—Vertically stacking multiple layers of DRAM chips (currently mainstream at 8 to 12 layers, with HBM4 advancing to 16 layers), exponentially increasing storage density and parallel channels within the same physical footprint.

TSV (Through-Silicon Via)—Etching microvias with diameters of only 5-10 micrometers inside each DRAM layer, filling them with conductive material to form vertical channels, enabling inter-layer connectivity at thousands of levels. This contrasts sharply with traditional PCB wiring: conventional routing spans centimeters or meters, while TSV signal transmission distances are compressed to micrometers, greatly reducing signal attenuation and latency.

Silicon Interposer—HBM stacks connect to the GPU/CPU via micro-bumps to a silicon interposer, which then connects to the chip at extremely short distances, forming a unified package. This entire structure is realized through advanced 2.5D packaging processes like CoWoS for high-density integration.

The core breakthrough of this architecture lies in the bus width. A typical HBM stack has a total bus width of 1024 bits, which can be expanded to 2048 bits with HBM3E. For example, SK Hynix’s latest mass-produced HBM3E modules reach 24GB per chip with bandwidth exceeding 1TB/s. In contrast, traditional GDDR solutions have a bus width of only 32 bits per chip or 384 bits when multiple chips are combined, resulting in a massive difference in data transfer capacity.

HBM’s underlying design philosophy is “wide and slow”—exchanging a large number of parallel channels for total bandwidth, with each channel operating at relatively low frequency, thus achieving significantly better energy efficiency than high-frequency solutions. Conversely, GDDR follows a “narrow and fast” approach—relying on higher operational frequencies to squeeze bandwidth from fewer channels. These two philosophies are suited to entirely different application scenarios: HBM aims for maximum throughput, while GDDR seeks a balance between throughput and cost.

HBM vs GDDR6: The “Wide and Slow” vs “Narrow and Fast” Showdown

HBM and GDDR6 are both part of the DRAM family, with the core mission of providing data access channels for GPUs, but they differ fundamentally in design goals, performance characteristics, and cost structures.

Bandwidth: HBM3E single-stack bandwidth can reach 1.2TB/s, with next-generation HBM4 expected to exceed 2.0TB/s. GDDR6X’s maximum bandwidth per card is about 1TB/s, approaching physical limits in top products. However, in terms of energy consumption per unit bandwidth, HBM is significantly more efficient, translating directly into quantifiable operational cost advantages in large-scale AI data center deployments.

Power and Latency: Thanks to the extremely short TSV paths, HBM consumes about 30% less energy than GDDR5. In latency terms, GDDR relies on PCB wiring and communication with the GPU, with delays typically in the microsecond range; HBM, with memory directly packaged near the GPU chip, reduces latency to nanoseconds. Notably, in ultra-high throughput scenarios, random access latency of HBM is slightly higher than GDDR, but for large-scale parallel streaming access patterns—typical in AI training and inference—throughput is the critical bottleneck.

Cost: This is where HBM’s biggest disadvantage lies. Industry data shows that the cost per GB of HBM exceeds $25, while GDDR6 is only about $5-8. HBM accounts for 60%-80% of the total cost of high-end GPUs. When considering cost per bandwidth, GDDR6 actually outperforms HBM—when the application’s absolute peak bandwidth demand is not high, GDDR6 offers a clear cost-performance advantage.

Overall, choosing between HBM and GDDR is fundamentally a trade-off between performance boundaries and cost constraints. HBM is suited for scenarios where “a certain bandwidth threshold must be met to operate”—for example, inference of trillion-parameter models—below which the system cannot function effectively. GDDR6, on the other hand, targets “seeking the lowest cost at acceptable performance levels,” typical for deploying medium-sized models with 7B-13B parameters.

They are not replacements for each other but are parallel technological routes serving different demand levels. However, in AI training and large-scale inference, HBM’s advantages are gradually pushing GDDR out of the core competitive arena.

The “Memory Wall”: Why Larger AI Models Exponentially Increase HBM Demand

Understanding the explosive growth in HBM demand requires returning to a core bottleneck in AI computing— the “Memory Wall.”

The Divergence of Computing Power and Bandwidth Growth

Over the past three decades, processor computing power has followed Moore’s Law, doubling approximately every 18-24 months; meanwhile, memory bandwidth has grown much more slowly. Studies on AI and the memory wall show that AI model compute power increases about 3-fold every two years, but memory bandwidth only grows about 1.6-fold, with interconnect bandwidth increasing even less. This means each upgrade in compute capability effectively devalues the data transfer capacity.

This contradiction is especially pronounced during inference. Training mainly involves matrix multiplication (GEMM), with high compute density and an arithmetic intensity of over 100 FLOPs/byte; inference, however, relies on matrix-vector multiplication (GEMV), with an arithmetic intensity often below 2 FLOPs/byte. The lower the arithmetic intensity, the more performance depends on memory bandwidth rather than compute power—this is the essence of the “bandwidth wall.”

The “Transport Burden” of Large Models in Inference

The basic process of large model inference can be summarized as: each generated token requires loading the entire model’s parameters from memory into the compute core. For example, the Llama 3 70B model, in FP16 precision, has about 140GB of weights. Generating one token involves transferring this 140GB once. To generate 30 tokens per second smoothly, the bandwidth between HBM and the compute core must support roughly 4.2TB/s.

This demand is already approaching or exceeding current hardware limits. NVIDIA’s H100 SXM5 has a bandwidth of 3.35TB/s. Even the most advanced AI accelerators are barely able to handle 70B parameters at this bandwidth. As model sizes grow into hundreds of billions or trillions of parameters, the required bandwidth will increase linearly or even super-linearly.

Capacity and Bandwidth Double Constraints

Memory capacity is another critical dimension. If a model’s total parameters exceed the capacity of a single GPU’s HBM, the model must be split across multiple GPUs—called tensor parallelism. But this introduces new communication bottlenecks: frequent transfer of intermediate results between GPUs can further reduce overall efficiency.

Thus, HBM’s value lies in two aspects: bandwidth determines the minimum inference latency and output speed; capacity determines whether the model fits into a single card, how many cards are needed, and the cost of cross-GPU communication.

The industry path is clear: HBM is shifting from a “high-end optional” to a “standard configuration” for AI computing. According to TrendForce, demand for HBM will grow over 130% year-over-year in 2025, and continue to grow over 70% in 2026 on a high base. HBM has evolved from a supporting role in graphics to an indispensable core component of AI compute infrastructure.

Industry-Wide Impact: From Technology Choices to a Trillion-Dollar Market Imbalance

Market Scale Surge

The expansion of the HBM market has outpaced early industry forecasts. According to SEMI China data, by 2026, the HBM market size is expected to grow 58% to $54.6 billion, accounting for nearly 40% of the entire DRAM market. Micron projects a compound annual growth rate (CAGR) of about 40% for the HBM total addressable market (TAM), rising from approximately $35 billion in 2025 to over $100 billion by 2028—surpassing the entire DRAM market size in 2024.

Supply-Side Rigid Constraints

However, the explosive demand and the rigid capacity of supply create a sharp contradiction. SEMI data shows that although Samsung, SK Hynix, and Micron have shifted 70% of their new or adjustable capacity toward HBM production, overall HBM capacity still falls short by 50%-60%.

The root of capacity bottlenecks lies in the high barriers to HBM manufacturing. Producing HBM requires advanced DRAM process nodes (currently at 1β nm for top manufacturers), as well as TSV etching, micro-bump bonding, wafer-level packaging, and other advanced packaging technologies. TSMC’s CoWoS packaging capacity, a key platform for integrating HBM with GPUs, is expected to expand to over 125k wafers per month by late 2026—up about 79% year-over-year—but still cannot fully meet orders from NVIDIA, AMD, Broadcom, and others.

Supply Chain Risks and Price Transmission

The capacity gap directly impacts prices. The unit price of HBM3E rose 5%-10% in 2025. More concerningly, as the three major manufacturers shift large-scale capacity toward HBM, supply of consumer-grade DDR memory shrinks significantly, with prices expected to continue rising through 2026. The shortage of HBM is exerting upward pressure on the entire memory supply chain through capacity constraints.

In June 2026, Jensen Huang confirmed that SK Hynix, Samsung, and Micron have all completed certification and begun large-scale supply of HBM4 chips. Samsung led the industry with mass production starting in February 2026. Even with these capacity expansions, the supply-demand gap for HBM in 2025-2026 remains around 50%. Achieving short-term supply-demand balance remains difficult. The combined effects of upstream capacity expansion, packaging bottlenecks, and the rapid growth of AI compute needs create a dynamic but persistently tight supply-demand landscape.

Conclusion

From the underlying technological innovations to the rigid dependence of AI scenarios, and from the entire industry chain’s supply-demand imbalance, HBM has evolved from a memory technology branch into a core battleground of AI infrastructure competition.

Its irreplaceability in AI training and inference stems from a fundamental computing logic: once model parameters surpass a certain threshold, bandwidth ceases to be an “optimization” and becomes an “enabler”—below that threshold, the system cannot operate effectively. While GDDR6 has cost advantages, its narrow channels and high-frequency architecture cannot match the computational density of trillion-parameter models in terms of bandwidth and energy efficiency. This structural difference determines that in the core AI compute race, HBM and GDDR are not simply competitors but serve different layered demand levels.

Looking ahead, the mass production of HBM4 (with single-stack bandwidth expected to exceed 2TB/s), the maturation of 16-layer stacking technology, and the adoption of new packaging processes like hybrid bonding will further push HBM’s performance ceiling. However, manufacturers like Huawei are exploring algorithmic optimizations to reduce reliance on HBM, as well as alternatives such as SRAM and in-memory computing architectures. Whether HBM can maintain its technological lead amid ongoing iterations, and whether its supply bottlenecks can be effectively alleviated during capacity expansion cycles, will be among the most critical variables in the AI compute industry in the coming years.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned