AIMPACT message, May 15 (UTC+8), Stanford NLP team pointed out on Twitter that most publicly available agentic training data still mainly focuses on the post-training phase, especially for models like Qwen (which may have already been trained on large amounts of agent data). The team believes that to train a good open-source model from scratch, the amount of agent data required far exceeds what is needed for just post-training from open weights, highlighting the current insufficiency of agent training data during the pre-training phase. (Source: InFoQ)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

8 Likes

Reward
8
12
8
Share

Comment

Add a comment

SushiAndSlugs

· 4h ago

Does this count as an easy win for Qwen?

View OriginalReply0

FlamingoFacingJudgment

· 5h ago

Open-source models want to pursue closed-source, data barriers are harder to break than computing power

View OriginalReply0

ColdLightNftCabinet

· 8h ago

Open source communities need to think about how to crowdfund pre-training data.

View OriginalReply0

GateUser-a365d15f

· 8h ago

It feels like we're back to the old story that data is power.

View OriginalReply0

GateUser-46033407

· 8h ago

The amount of data needed to train from zero sounds hopeless.

View OriginalReply0

PerpNightshift

· 9h ago

This research has handed a weapon to the closed-source camp.

View OriginalReply0

GateUser-46c777d0

· 9h ago

Stanford's observation is very accurate; agent capabilities are indeed built up through training in later stages.

View OriginalReply0

GlassDomeRoaming

· 9h ago

There is always a limit to post-training optimization, and the shortcomings of pre-training will eventually be exposed.

View OriginalReply0

GateUser-e84f640c

· 9h ago

This conclusion is quite discouraging for small and medium teams; the data threshold is getting higher and higher.

View OriginalReply0

ExitLiquidityStan

· 9h ago

I hope someone can open-source some high-quality pre-trained agent data.

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
200.69K Popularity
#
30YearTreasuryYieldBreaks5%
366.61K Popularity
#
DailyPolymarketHotspot
1M Popularity
#
RWAMarketCapExceeds65Billion
8.76M Popularity
#
GateSquarePizzaDay
1.69M Popularity

Pinned

Sitemap

Stanford NLP: Most publicly available agent training data still concentrates on the post-training phase

Trending Topics

TradfiTradingChallenge

30YearTreasuryYieldBreaks5%

DailyPolymarketHotspot

RWAMarketCapExceeds65Billion

GateSquarePizzaDay

Pinned