Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
CFD
U.S. stock CFD derivatives
US Stocks
Access real US stocks and ETFs
HK Stocks
Trade quality Hong Kong-listed stocks
Stock Futures
High leverage, 24/7 trading
Tokenized Stocks
Backed by real stock assets
IPO Access
Unlock full access to global stock IPOs
GUSD
Mint GUSD for Treasury RWA yields
Stocks Activities
Trade Popular Stocks and Unlock Generous Airdrops
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
IPO Access
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
DGrid AI's latest research tackles a core flaw in decentralized AI Scoring
DGrid AI introduces a new Proof of Quality framework designed to evaluate AI outputs and improve reward distribution across decentralized networks.
Summary
Decentralized AI networks have a payment problem that researchers have been quietly working around for years, and a recent paper from DGrid AI puts the issue directly on the table. The quality scoring systems powering node rewards have largely depended on having the correct answer on hand to compare against. In production, that answer rarely exists.
The paper, the fourth in DGrid’s ongoing research series on Proof of Quality (PoQ), proposes a trained alternative and publishes the numbers behind it. PoQ uses small evaluator models to score each output’s quality, and those scores drive the rewards. Cheap, and it scales.
DGrid built this up brick by brick: a cost-aware version that bakes latency into the payout math, an adversarial-robustness layer that holds up when scorers turn liar or lazy, and a framework that splits “quality” into parts you can inspect. Solid engineering. And every layer kept slamming into the same wall.
How the scoring problem developed
The basic structure of a decentralized inference network creates a measurement challenge. Independent nodes run language models and respond to user queries. Those responses need to be scored because scores determine pay. Cryptographic verification of every computation would be technically airtight but prohibitively expensive at scale, so the practical path has been automated quality evaluation using smaller models.
DGrid’s earlier work built out that approach incrementally, adding latency-adjusted payouts, defenses against manipulative scorers, and a more granular breakdown of what “quality” actually means in a scoring context. What it could not fully resolve was the evaluation signal itself.
The strongest signal the team had was semantic similarity: compare the model’s output to a known correct answer and measure the distance between them in embedding space. That works in benchmark environments where reference answers exist. It does not work in a live network where users are asking open-ended questions and no ground truth is waiting in a database.
Off-the-shelf alternatives tested worse. An NLI cross-encoder, a model class designed to assess logical entailment between sentences, returned a Pearson correlation of −0.363 when used to rate answer quality without a reference answer. A negative correlation means the model was more likely to favor poor responses over good ones. That is not a usable evaluation tool.
What the paper proposes
Instead of adapting existing models, the researchers trained three judges specifically for reference-free quality scoring. Each takes a question and a response as input and outputs a score from 0 to 10, with no correct answer provided.
The three models differ primarily in size and speed:
Training followed a two-stage process. The models were first pre-trained on UltraFeedback, a public dataset of GPT-4-graded responses, before fine-tuning on the network’s own task distribution. The intent was to give the judges a broad baseline understanding of quality before narrowing their focus to the specific scoring context.
The core result
On a held-out test set of 300 examples, the DeBERTa judge achieved a Pearson correlation of 0.747 against the ground-truth proxy — without access to any reference answer. The reference-based evaluators from the prior framework, which did have access to correct answers, reached a maximum of 0.647.
The gap has a straightforward explanation. The older evaluators were similarity metrics measuring cosine distance to a reference embedding. The new judges were optimized end-to-end for the scoring task itself. The performance difference reflects that distinction more than any architectural breakthrough.
One caveat the authors include: the ground truth used here is itself a proxy — token-level word overlap rather than human judgment. The judges correlate well with this metric, but whether word overlap reliably reflects what a human would consider a quality response is a separate, unresolved question.
Two deployment-oriented features accompany the judges. A cascading pipeline routes queries through the lightweight model first and escalates to heavier models only when scores are ambiguous, reducing evaluation costs by up to 72.7% at the most aggressive threshold setting, though correlation drops to around 0.51 in that configuration. An online calibration mechanism, running without manual tuning, consistently identifies semantic quality as the dominant signal and adjusts weights accordingly, assigning it 4.7 times its starting weight over time.
Where the system still struggles
The judges perform unevenly across task types. On question answering, correlation reaches 0.830. On summarization, it falls to 0.199. The paper attributes this not to a failure in the judges themselves but to the evaluation metric used during training: raw word overlap is a poor measure of summarization quality, so models trained against it learn to track a weak signal. The authors describe this as the primary open problem rather than a known limitation being managed quietly.
That framing is consistent with how the paper presents its results overall — methodically, with the failure cases as clearly stated as the improvements. Four papers into this research thread, the work reads less like a product announcement and more like a team incrementally closing gaps in something they intend to actually deploy.
Disclosure: This content is provided by a third party. Neither crypto.news nor the author of this article endorses any product mentioned on this page. Users should conduct their own research before taking any action related to the company.