Caitong Securities: Architectural Innovation Breaks Through Large Model Inference Latency Bottleneck, Vast Market Space Expected to Scale Rapidly

MaticHoleFiller · 2026-03-16T03:15:24+00:00

Caiton Securities research report points out that LPU is a next-generation TSP architecture chip designed for large model inference, featuring low inference latency and excellent performance, with expectations for rapid market penetration and high growth potential. It can reduce inference delay and improve user experience, and has entered early mass production. Risks include underperformance in AI technology and large model development.

MaticHoleFiller

2026-03-16 03:15:24

Abstract generation in progress

Caitong Securities released a research report stating that LPU is a new generation chip designed for large model inference, with the core based on TSP architecture. The firm believes that LPU benefits from excellent performance in low inference latency and is expected to achieve rapid penetration. They are optimistic about LPU’s high growth potential and the PCB opportunities brought by LPU shipments in cabinets. Recommended stocks include: Zhiwei Intelligence (001339.SZ) (stake in Yuanchuan Micro), Xingchen Technology (301536.SZ) (multiple rounds of funding for Yuanchuan Micro), Sh电股份 (002463.SZ) (NVIDIA PCB supplier), Shenghong Technology (300476.SZ) (NVIDIA PCB supplier), and Shennan Circuit (002916.SZ).

Caitong Securities’ main points are as follows:

LPU is a new generation chip designed for large model inference, with TSP architecture at its core.

LPU is a new chip architecture tailored for sequential processing of compute-intensive tasks. Its core is based on TSP architecture, which includes five major functional modules. It disassembles the traditional five-stage processor pipeline across the entire chip, eliminating hardware complexity and ensuring deterministic instruction execution order and timing. Under the TSP architecture, compilers can directly access and precisely control the chip’s underlying hardware state, enabling software-defined hardware.

LPU can reduce latency during large model inference, enhancing user experience.

Latency during large model inference is closely linked to user experience. The main bottleneck is in the Decode stage, limited by memory bandwidth. LPU offers faster memory bandwidth, which shortens inference latency. Additionally, large models based on LPU not only have faster inference speeds but also offer better cost performance, further improving user experience.

LPU has broad potential for development and has entered initial mass production.

Currently, token consumption has surged significantly. By early 2024, China’s daily token consumption is projected at 100 billion, and by February 2026, the daily token consumption of mainstream large models has reached 180 trillion, driving rapid growth in the inference chip market. LPU can reduce inference latency for large models. The firm believes LPU is expected to gradually penetrate the inference chip market, which has high growth potential. LPU has already entered initial mass production, with volume expansion imminent.

Risk warnings: Risks of AI technology development not meeting expectations; risks of large model development falling short; risks of industry development not meeting expectations.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes