Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%
ME News message: On April 18 (UTC+8), according to Beating Monitoring, Moonshot AI and Tsinghua University released a new paper on arXiv on April 16 titled “Prefill-as-a-Service,” proposing to run the prefill stage (prefill) of large-model inference across data centers. Large-model inference consists of two steps: prefill first reads the input all at once and generates a KV cache; decode then outputs the result token by token based on this cache. The hardware requirements for the two steps are completely different: prefill consumes compute, while decode consumes memory bandwidth. The mainstream industry approach is to split the two steps across different machines (PD separation), but this requires RDMA interconnection within the same data center, because the KV cache of dense attention models outputs tens of Gbps per second; if transmission is slow, the GPUs will sit idle. The turning point comes from a new generation of hybrid attention models. The paper’s experiments show that models such as Kimi Linear, MiMo-V2-Flash, Ring-2.5-1T, etc., reduce KV cache throughput by about one order of magnitude by combining a small number of full attention layers with a large number of linear layers; Ring-2.5-1T achieves an overall compression ratio of 36 times. At this point, the KV cache can be moved from the dedicated RDMA network to be uploaded over ordinary Ethernet.
The specific approach of PrfaaS is to set up an independent “prefill cluster,” routing only requests with long contexts and cache-miss prefixes to it, while short requests remain in the local PD cluster. After prefill is complete, the KV cache is sent back via Ethernet to the local cluster for decoding. It also introduces length-threshold routing, a bandwidth-aware scheduler, and a hybrid prefix cache pool. The paper reports a set of experiments using an internal 1T-parameter hybrid model (based on the Kimi Linear architecture), showing that overall service throughput is 54% higher than a homogeneous PD deployment, and 32% higher than a naive heterogeneous scheme, with each machine using only a moderate amount of cross-data-center bandwidth.
(Source: BlockBeats)