Ramp Labs proposes a new multi-agent memory sharing solution, reducing token consumption by up to 65%

MeNews · 2026-04-11T08:21:27+00:00

AI infrastructure company Ramp Labs has released the "Latent Briefing" research, which uses attention mechanisms to achieve efficient memory sharing in multi-agent systems, significantly reducing token consumption and improving accuracy. The method performs outstandingly in the LongBench v2 benchmark test, with Worker model token consumption reduced by 65%, and accelerates the compression process to adapt to different task and document length compression needs.

MeNews

2026-04-11 08:21:27

Abstract generation in progress

ME News, April 11 (UTC+8). AI infrastructure company Ramp Labs released its research findings, “Latent Briefing.” The report enables efficient memory sharing among multi-agent systems by directly compressing large model KV caches, significantly reducing token consumption without sacrificing accuracy.

In mainstream multi-agent architectures, the Orchestrator breaks down tasks and repeatedly calls on the Worker models; as the reasoning chain keeps getting longer, token usage grows exponentially.

The core idea of Latent Briefing is to use attention mechanisms to identify the truly critical parts in the context, then discard redundant information directly at the representation layer—rather than relying on slow LLM summaries or RAG retrievals with unstable performance.

In the LongBench v2 benchmark, the method performed exceptionally well: Worker model token consumption decreased by 65%, the median token savings for medium-length documents (from 32k to 100k) reached 49%, overall accuracy improved by about 3 percentage points compared with the baseline, and each compression added only about 1.7 seconds of additional latency—roughly a 20× speedup over the original algorithm.

The experiments used Claude Sonnet 4 as the Orchestrator and Qwen3-14B as the Worker model, covering a variety of document scenarios including academic papers, legal documents, novels, and government reports.

The study also found that the optimal compression threshold varies depending on task difficulty and document length: difficult problems are better suited to more aggressive compression to filter speculative reasoning noise, while long documents are more suitable for lighter compression to preserve dispersed key information. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.