Carved in stone - ForkLog: cryptocurrencies, AI, singularity, the future

img-e85279aa380bface-8456330719811929# Carved in Stone

How AI Chips Break Through the “Memory Wall”

Traditionally, consumer GPUs are designed for gaming and rendering. However, they are also capable of performing other tasks that require parallel computing.

For example, a PoW miner for cryptocurrency mining can run on a graphics card, but in competition with specialized equipment, GPU farms have become a solution for niche projects.

A similar situation is developing in the AI field. Graphics cards have become the primary computing tool for neural networks. But as the industry evolves, there is increasing demand for specialized solutions for AI workloads. ForkLog examined the current state of this new wave in artificial intelligence.

Silicon Optimization for AI

There are several approaches to creating specialized hardware for AI tasks.

Consumer GPUs can be considered the starting point for specialization. Their ability to handle parallel matrix computations has been useful for deploying neural networks and deep learning, but there was still room for improvement.

One of the main problems with AI on graphics cards is the constant need to transfer large amounts of data between system memory and the GPU. These accompanying processes can take more time and energy than the actual computations.

Another issue stems from the versatility of GPUs. The architecture of graphics cards is designed for a wide range of tasks—from rendering graphics to general-purpose computing. As a result, some hardware blocks are redundant for specialized AI workloads.

Data format is also a limiting factor. Historically, graphics processors were optimized for FP32 operations—32-bit floating-point numbers. For inference and training, lower-precision formats are typically used: 16-bit FP16 and BF16, as well as integer formats like INT4 and INT8.

Nvidia H200 and B200

Some of the most popular products for inference and training are the H200 chips and DGX B200 server systems—essentially “enhanced” GPUs for data centers.

The main AI-oriented component of these accelerators is the tensor cores, designed for ultra-fast matrix operations such as model training and batch inference.

To reduce data access latency, Nvidia equips its cards with large amounts of high-performance memory (HBM, High Bandwidth Memory). The H200 features 141 GB of HBM3e with a bandwidth of 4.8 TB/s, and the B200’s specs are even higher depending on the configuration.

Tensor Processing Unit

By 2015, Google developed the Tensor Processing Unit (TPU)—an ASIC processor based on systolic arrays designed for machine learning.

Tensor Processing Unit 3.0. Source: Wikipedia. In conventional processor architectures—CPUs and GPUs—each operation involves reading, processing, and writing intermediate data to memory.

TPU passes data through an array of blocks, each performing a mathematical operation and passing the result to the next. Memory access occurs only at the beginning and end of the computation sequence.

This approach allows for less time and energy consumption on AI calculations compared to non-specialized graphics processors, but working with external memory remains a limiting factor.

Cerebras

American company Cerebras has found a way to use an entire silicon wafer as a processor, which is usually cut into smaller pieces for chip manufacturing.

In 2019, they introduced their first 300-mm Wafer-Scale Engine. In 2024, the company released an improved processor, WSE-3, with a 460-mm chip containing 900,000 cores.

Cerebras WSE-3 and two Nvidia B200 chips. Source: Cerebras. The Cerebras architecture involves distributing SRAM memory blocks close to logic modules on the same silicon wafer. Each core works with its own 48 KB of local memory and does not compete with others for access.

According to developers, many inference models can run on a single WSE-3. For larger tasks, a cluster of multiple such chips can be assembled.

Groq LPU

Groq (not to be confused with Grok from xAI) offers its own ASICs for inference based on the Language Processing Unit (LPU) architecture.

Groq chip. Source: Groq. One key feature of Groq chips is optimization for sequential operations.

Inference relies on sequential token generation: each step requires finalizing the previous one. Under these conditions, performance depends more on the speed of a single thread than on the number of threads.

Unlike traditional general-purpose processors and some AI-specific devices, Groq does not generate machine instructions on the fly. Each operation is pre-planned in a kind of “schedule” and tied to a specific moment in the processor’s operation.

At the same time, like many other AI accelerators, LPU combines logic and memory modules on a single chip to minimize data transfer costs.

Taalas

All the above examples imply a high degree of programmability. The model and necessary weights are loaded into rewritable memory. At any moment, an operator can load a completely different model or make adjustments.

With this approach, performance depends on the availability, speed, and volume of memory.

Developers at Taalas went further, deciding to “embed” a specific model with preloaded weights directly into the chip at the transistor architecture level.

The model, which is usually software-based, is implemented in hardware, allowing the elimination of a separate universal data storage and its associated costs.

In their first solution—the HC1 inference card—the company used the open-source Llama 3.1 8B model.

Taalas HC1. Source: Taalas. The card supports low-bit precision down to 3-bit and 6-bit parameters, enabling faster processing. According to Taalas, HC1 processes up to 17,000 tokens per second, remaining a relatively inexpensive device with low power consumption.

The company claims a thousandfold increase in performance compared to GPUs in terms of energy efficiency and cost.

However, this method has a fundamental drawback—it’s impossible to update the model without replacing the entire chip.

At the same time, HC1 is equipped with support for LoRA—a method of “fine-tuning” LLMs by adding extra weights. With the right LoRA configuration, the model can be specialized for a particular domain.

Another challenge is the design and manufacturing process of such “physical models.” Developing ASICs is expensive and can take years. In the highly competitive AI industry, this is a significant limitation.

Taalas claims to have developed a new method for generating processor architecture, aimed at solving this problem. An automated system converts a model and its weights into a ready-made chip design within a week.

According to the company’s estimates, the production cycle—from receiving a new, previously unknown model to releasing finished chips with its physical implementation—will take about two months.

The Future of Local Inference

New specialized AI chips primarily occupy space in large data center installations, providing cloud services for a fee. Non-trivial solutions, including “physical models” implemented directly in silicon, are no exception.

For consumers, a revolutionary engineering breakthrough will manifest as lower costs and faster performance.

At the same time, the emergence of simpler, cheaper, and more energy-efficient chips creates the prerequisites for popularizing local inference solutions.

Already, specialized AI chips are found in smartphones and laptops, surveillance cameras, and even doorbells. They enable tasks to be performed locally, ensuring low latency, autonomy, and privacy.

Radical optimization—albeit at the expense of flexibility in model selection and replacement—significantly expands the capabilities of such devices and allows simple AI components to be integrated into inexpensive mass-market products.

If most users start directing their queries to models running on local devices, the load on data center capacities could decrease, reducing the risk of industry overload. Perhaps then, there will be no need to pursue radical solutions like launching computing power into orbit.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin