American companies turn to Chinese AI models, Coinbase leads the way using GLM and Kimi.

U.S. tech companies are quietly integrating Chinese open-source AI models into their production infrastructure. As the cost of top-tier American model services continues to rise, companies like Coinbase are turning to Chinese open-source models as their default option, significantly reducing AI expenses without suppressing usage.

On Friday evening, Coinbase CEO Brian Armstrong posted on X that the company has set Zhipu’s newly released GLM 5.2 and Beijing Moonshot AI’s Kimi 2.7 as the default models for engineers via its internal LLM gateway. Armstrong stated that with routing optimization and caching improvements, Coinbase has cut AI spending by "nearly half," while token usage continues to grow exponentially.

Cost Advantage of Chinese Open-Source Models Takes Center Stage

In his post, Armstrong clearly noted that 91% of engineers had never hit the original usage caps, so instead of lowering caps or adding spending alerts, Coinbase opted to "switch to cheaper default models."

GLM 5.2 is from Zhipu, Kimi 2.7 from Beijing Moonshot AI, both of which are open-weight models. Armstrong said these models are deployed for routine tasks, while engineers can still use frontier models for tasks requiring complex planning. His logic: using top-tier models for execution is often "overkill."

For code review, a multi-model parallel strategy is used, allowing different models to cross-check each other's outputs to maintain quality standards.

Three-Layer Infrastructure Restructuring Drives Cost Reduction

Armstrong outlined three core measures.

First, intelligent routing: In a custom scheduling framework, the system preprocesses prompts and combines cache hit rates with model pricing to automatically distribute tasks to the most suitable and cost-effective model. He noted that the ultimate goal is to have AI, rather than humans, handle model selection.

Second, aggressive caching: Coinbase requires all requests to be cache-aware, reusing existing caches as much as possible. For example, with LibreChat, after properly implementing caching, the cache hit rate jumped from 5% to 60%.

Third, context streamlining: Armstrong recommends starting new sessions when switching tasks, narrowing the scope of file context, and disconnecting unused tools. He emphasized that the goal is not to reduce total token usage but to reduce "wasted tokens."

Efficiency First, Not Usage Suppression

Armstrong characterized this cost compression as a prerequisite for scaling AI adoption, not a limitation. He said engineers are still free to use any amount of tokens and any model, but the company has made usage data visible and tied usage to business impact—"spend more, we expect more impact."

He did not disclose specific absolute spending figures. But structurally, achieving nearly half the cost reduction while usage grows exponentially indicates that Coinbase has partially decoupled consumption from cost.

Armstrong concluded that this approach is universal and can be replicated by any enterprise to achieve sustainable AI scaling without making cost a ceiling.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments