Ni Yayu: Focusing on edge-side intelligent applications, Guoke Micro develops more efficient NPU and toolchains

MaticHoleFiller · 2026-04-06T19:52:01+00:00

Guoke Micro showcased the progress of NPU technology at the 2026 China IC Leaders Summit, particularly the application of the FlashAttention algorithm. Ni Yayu pointed out that although FlashAttention has significantly improved inference efficiency, the NPU still needs to overcome challenges such as vector unit computing capabilities. Guoke Micro is committed to algorithm and hardware collaborative innovation, promoting the development of efficient intelligent computing platforms to meet market demands for computing power.

MaticHoleFiller

2026-04-06 19:52:01

Abstract generation in progress

If you’re trading stocks, just look at the analyst research reports from Jin Qilin. Authoritative, professional, timely, and comprehensive—helping you uncover high-potential themes and opportunities!

（Source: @E1@ Ijiwei）

On April 1, at the 2026 China IC Leaders Summit “Edge AI and Compute Chips” vertical technical forum, Ni Yayu, Minister of the AI Algorithms Department at Guokewei, delivered a keynote speech titled “FlashAttention-4: Design of a New Generation of Large-Model Inference NPU Pipeline Paradigm.”

As large models accelerate toward industrial deployment, inference efficiency, memory bandwidth, and system power consumption have become the key bottlenecks for edge-side deployment. Especially in the context of the continuous evolution of Transformers and large language models, efficient implementation of the attention mechanism (Attention) has become an important breakthrough in chip architecture and toolchain optimization.

Ni Yayu said that Guokewei is focusing on the deployment and exploration of frontier technologies such as FlashAttention on NPU platforms, working to build an NPU architecture and toolchain better suited for end-side mass-production deployment, providing high-performance computing support for scenarios such as autonomous driving, edge computing, smart terminals, and AIGC.

Challenges still remain for “full-blood” FlashAttention in NPU deployment

As one of the core computing structures in large models, the attention mechanism commonly faces issues such as high memory access overhead and limited pipeline efficiency during real operation. The introduction of FlashAttention provides a new path to solve this problem.

FlashAttention is a fast and memory-efficient exact attention algorithm proposed in 2022 by Tri Dao and others at Stanford University. By performing equivalent re-structuring of the attention computation process, and through techniques such as block-wise computation, online Softmax, recomputation, and asynchronous pipelining, it keeps the intermediate computation process in on-chip cache, reducing pressure on external storage-access bandwidth, and significantly improving inference efficiency.

In the just-passed mid-March, the FlashAttention 4.0 version was officially released. Ni Yayu noted that from 1.0 to 4.0, FlashAttention has continued to improve in areas such as parallelism, long-sequence support, low-precision computation, and asynchronous execution. However, compared with GPUs, current NPUs still lag in capabilities such as vector-unit compute, asynchronous pipeline scheduling, dynamic scheduling, and ultra-long context. Ni Yayu said that to achieve “full-blood” FlashAttention, it is necessary to coordinate the design around the compute pipeline, data reuse, and system bandwidth.

Guokewei NPU 4.0: Building a more efficient inference unit

Since 2020, Guokewei has continued to invest in the independent R&D of NPUs, forming an evolution roadmap from GKNPU 1.0 to 4.0, with product capabilities upgrading toward higher compute performance, broader model coverage, and better energy-efficiency ratios. Currently, Guokewei’s AI vision and in-vehicle AI series chips have adopted the 3.0-version NPU, supporting 0.5T to 8T compute performance, and supporting the deployment of AI models such as vision, audio, and time-series on edge-side chips.

In the GKNPU 4.0 architecture design, Guokewei proposes an enhanced systolic array architecture for efficient attention computation. It specifically expands matrix and vector computation capabilities, strengthens support for key operations in the attention mechanism of large models, compresses data-movement paths and pipeline overhead, and enhances on-chip closed-loop computation capability. This design aims to reduce reliance on external bandwidth, improve inference-link execution efficiency, and effectively address memory pressure in large-model inference caused by bandwidth bottlenecks, activation fragmentation, and ultra-long contexts.

Strengthening the toolchain to promote efficient large-scale deployment

Along with the evolution of the NPU architecture, Guokewei continues to strengthen toolchain capabilities. The new generation GKToolchain 3.0 is aimed at end-side heterogeneous compute scenarios. It focuses on improving abilities such as hardware-aware compilation, automatic partitioning, automatic vectorization, asynchronous data read/write, and compute pipeline orchestration, pushing model deployment from “adaptable” toward “high efficiency and scalable.”

Meanwhile, the toolchain continues to evolve along frontier directions such as dynamic memory management and speculative inference acceleration, enhancing its support for managing long contexts and complex inference workflows, helping customers efficiently complete the deployment closed loop from model to chip.

As AI applications move from the training side to the inference side, and from the cloud to the terminal, industry requirements for compute platforms are shifting from “peak performance” toward comprehensive capabilities such as “high energy efficiency, mass-producible, and easy to deploy.” NPUs have clear cost and power-consumption advantages in large-scale edge-side deployment.

Ni Yayu said that Guokewei will continue to insist on algorithm-and-hardware co-innovation, and, around the core bottlenecks of large-model inference, keep improving the NPU architecture, product capabilities, and toolchain system, advancing edge-side intelligent computing platforms toward higher performance, lower power consumption, and stronger engineering deployability, and providing customers with more competitive compute solutions.

Endless information and precise interpretation—on the Sina Finance APP

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.