The 70-layer model's computational power matches that of traditional small models, and the GA/SWA 1:7 architecture design has some real substance.

View Original
BlockBeatNews
Lofree reveals MiMo's cost-reduction secret weapon: pre-filled attention reduces computational load to 10 layers at the global GQA level
Xiaomi MiMo-V2.5, after the API's permanent price reduction, announced cost savings through the use of hybrid attention and hierarchical KV caching: significant improvements in cache hit rate and capacity, a substantial decrease in cache costs, and further reduction of overhead by combining cache overlaps. Input and output costs decreased by 60–80%, because the GA/SWA layer ratio is 1:7, and during the prefill stage, only local windows are calculated, making the 70-layer model's computational power comparable to that of a traditional model with fewer layers. The price reduction is a structural cost saving, advocating for the coordination of underlying algorithms and inference systems to control costs and avoid a price war.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned