Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Launchpad
Be early to the next big token project
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Ni Yayu: Focusing on edge-side intelligent applications, Guoke Micro develops more efficient NPU and toolchains
(Source: @E1@ Ijiwei)
On April 1, at the 2026 China IC Leaders Summit “Edge AI and Compute Chips” vertical technical forum, Ni Yayu, Minister of the AI Algorithms Department at Guokewei, delivered a keynote speech titled “FlashAttention-4: Design of a New Generation of Large-Model Inference NPU Pipeline Paradigm.”
As large models accelerate toward industrial deployment, inference efficiency, memory bandwidth, and system power consumption have become the key bottlenecks for edge-side deployment. Especially in the context of the continuous evolution of Transformers and large language models, efficient implementation of the attention mechanism (Attention) has become an important breakthrough in chip architecture and toolchain optimization.
Ni Yayu said that Guokewei is focusing on the deployment and exploration of frontier technologies such as FlashAttention on NPU platforms, working to build an NPU architecture and toolchain better suited for end-side mass-production deployment, providing high-performance computing support for scenarios such as autonomous driving, edge computing, smart terminals, and AIGC.
Challenges still remain for “full-blood” FlashAttention in NPU deployment
As one of the core computing structures in large models, the attention mechanism commonly faces issues such as high memory access overhead and limited pipeline efficiency during real operation. The introduction of FlashAttention provides a new path to solve this problem.
FlashAttention is a fast and memory-efficient exact attention algorithm proposed in 2022 by Tri Dao and others at Stanford University. By performing equivalent re-structuring of the attention computation process, and through techniques such as block-wise computation, online Softmax, recomputation, and asynchronous pipelining, it keeps the intermediate computation process in on-chip cache, reducing pressure on external storage-access bandwidth, and significantly improving inference efficiency.
In the just-passed mid-March, the FlashAttention 4.0 version was officially released. Ni Yayu noted that from 1.0 to 4.0, FlashAttention has continued to improve in areas such as parallelism, long-sequence support, low-precision computation, and asynchronous execution. However, compared with GPUs, current NPUs still lag in capabilities such as vector-unit compute, asynchronous pipeline scheduling, dynamic scheduling, and ultra-long context. Ni Yayu said that to achieve “full-blood” FlashAttention, it is necessary to coordinate the design around the compute pipeline, data reuse, and system bandwidth.
Guokewei NPU 4.0: Building a more efficient inference unit
Since 2020, Guokewei has continued to invest in the independent R&D of NPUs, forming an evolution roadmap from GKNPU 1.0 to 4.0, with product capabilities upgrading toward higher compute performance, broader model coverage, and better energy-efficiency ratios. Currently, Guokewei’s AI vision and in-vehicle AI series chips have adopted the 3.0-version NPU, supporting 0.5T to 8T compute performance, and supporting the deployment of AI models such as vision, audio, and time-series on edge-side chips.
In the GKNPU 4.0 architecture design, Guokewei proposes an enhanced systolic array architecture for efficient attention computation. It specifically expands matrix and vector computation capabilities, strengthens support for key operations in the attention mechanism of large models, compresses data-movement paths and pipeline overhead, and enhances on-chip closed-loop computation capability. This design aims to reduce reliance on external bandwidth, improve inference-link execution efficiency, and effectively address memory pressure in large-model inference caused by bandwidth bottlenecks, activation fragmentation, and ultra-long contexts.
Strengthening the toolchain to promote efficient large-scale deployment
Along with the evolution of the NPU architecture, Guokewei continues to strengthen toolchain capabilities. The new generation GKToolchain 3.0 is aimed at end-side heterogeneous compute scenarios. It focuses on improving abilities such as hardware-aware compilation, automatic partitioning, automatic vectorization, asynchronous data read/write, and compute pipeline orchestration, pushing model deployment from “adaptable” toward “high efficiency and scalable.”
Meanwhile, the toolchain continues to evolve along frontier directions such as dynamic memory management and speculative inference acceleration, enhancing its support for managing long contexts and complex inference workflows, helping customers efficiently complete the deployment closed loop from model to chip.
As AI applications move from the training side to the inference side, and from the cloud to the terminal, industry requirements for compute platforms are shifting from “peak performance” toward comprehensive capabilities such as “high energy efficiency, mass-producible, and easy to deploy.” NPUs have clear cost and power-consumption advantages in large-scale edge-side deployment.
Ni Yayu said that Guokewei will continue to insist on algorithm-and-hardware co-innovation, and, around the core bottlenecks of large-model inference, keep improving the NPU architecture, product capabilities, and toolchain system, advancing edge-side intelligent computing platforms toward higher performance, lower power consumption, and stronger engineering deployability, and providing customers with more competitive compute solutions.
Endless information and precise interpretation—on the Sina Finance APP