Trillion-parameter open-source model runs at 981 words/second, Cerebras tests KimiK2.6 with 29x speedup.

robot
Abstract generation in progress
ME News message: On May 20 (UTC+8), according to Dongcha Beating monitoring, wafer-level chip company Cerebras announced that its trillion-parameter large model Kimi K2.6 has been deployed in enterprise testing. By directly integrating the chip onto a full 12-inch silicon wafer, it completely eliminates the interconnect latency of traditional board-level communication. Third-party evaluation agency Artificial Analysis’s hands-on testing shows that its generation speed reaches 981 tokens/s, which is 6.7 times faster than mainstream GPU cloud services. In a long-text task with 10,000 input tokens and 500 output tokens, the total response time was reduced from 163.7 seconds on Kimi’s official interface to 5.6 seconds, representing a 29-times speedup. Since the model weights are distributed across multiple wafers and activation values are streamed, inter-layer communication runs entirely on the network fabric inside the wafer, and its physical communication bandwidth is more than 200 times that of NVLink in NVIDIA’s NVL72 architecture. Combined with distributed computing optimization, Kimi K2.6 achieves real-time operation by using low-loss storage with the original 4-bit (4-bit) weights, using 16-bit (16-bit) floating-point numbers during computation to maintain precision, and adopting custom operator kernels and speculative decoding. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned