No weight adjustment, pure API tuning: Poetiq "plugin" boosts Kimi by 29.9 percentage points, lightweight Gemini counterattacks Claude Opus

robot
Abstract generation in progress
AIMPACT News, May 15 (UTC+8), according to Beating Monitoring, a six-person startup team Poetiq, founded by former Google and DeepMind researcher Shumeet Baluja and Ian Fischer, announced that their Meta-System has set a new record on the programming benchmark LiveCodeBench Pro. This system is a purely API-accessible intelligent plugin (Harness) that recursively self-improves to automatically extract task experience. Official tests show that, without touching model weights or fine-tuning, the system directly boosts the coding capabilities of mainstream large models on the market.
Test results indicate that this decoupled plugin approach significantly improves weaker models. After integrating Poetiq system, Kimi K2.6's accuracy skyrocketed from 50.0% to 79.9%, an absolute score increase of 29.9 percentage points; lightweight Gemini 3.0 Flash improved by 10 points, not only surpassing its larger version Gemini 3.1 Pro but also defeating the "bigger and more expensive" Claude Opus 4.7 and GPT 5.2 High, as claimed by Poetiq.
In terms of pushing performance limits, GPT 5.5 High, originally scoring 89.6%, reached a new height of 93.9% with the plugin; while the basic Gemini 3.1 Pro, paired with this plugin, scored 90.9%, directly surpassing Google's most powerful reasoning model Gemini 3 Deep Think (88.8%) which has not yet opened its API.
Poetiq team stated that traditional fine-tuning locks the improvement effects onto a single model, whereas their seamless plug-and-play system allows enterprises to avoid the high costs of fine-tuning and deploying full-capacity models solely for reasoning capabilities.
(Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 7
  • 2
  • Share
Comment
Add a comment
Add a comment
CandleChaser
· 3h ago
Is the idea of a weak model + strong external plugins implying that the spring of small models is coming?
View OriginalReply0
GateUser-b74aba1c
· 9h ago
A six-person team breaks through the large model inference bottleneck; API layer innovation has been underestimated for too long.
View OriginalReply0
L2ArbitrageTrader
· 10h ago
Poetiq's six members achieved this result; the team composition is worth studying.
View OriginalReply0
CyberBridgeDeepPerspective
· 10h ago
GPT5.5 Achieved 93.9% performance, Gemini3.1 Pro scored 90.9% after configuration, this mod surpasses the stock top-tier version.
View OriginalReply0
GateUser-cbb8cdf5
· 10h ago
The company is ecstatic, finally no longer needing to spend money on fine-tuning.
View OriginalReply0
BribeCoffee
· 10h ago
Recursive self-improvement + experience extraction, the Meta-System architecture is quite interesting
View OriginalReply0
VineGeometry
· 10h ago
Pure API cheats that don't touch weights can boost Kimi from 50% to 79%, and this approach is much smarter than fine-tuning.
View OriginalReply0
  • Pinned