Implementing Transformer as a pure hardware circuit, achieving 50k tokens per second without using a GPU

CryptoWorld News: Developers Luthira Abeykoon and Krish Chhajer have ported Karpathy’s MicroGPT (only 4,192 parameters) into FPGA using SystemVerilog, achieving generation speeds of over 50k tokens per second. The project Talos-V2 (Tensor Accelerated Logic for On-Chip Systems) is open source on GitHub, running on a DE1-SOC Cyclone V educational-grade Intel FPGA, with weights stored in on-chip ROM in Q4.12 fixed-point format. The matrix-vector multiplication in the model is implemented as a 16-channel systolic array, with Q/K/V projections, MLP, and LM Head sharing this unit and running sequentially. The implementation of the attention mechanism needs to be split into eight steps. The authors state that the project aims to convert each step of Transformer inference into visual hardware: memory, counters, state machines, and lookup tables.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin