Stanford NLP: Most publicly available agent training data still concentrates on the post-training phase

robot
Abstract generation in progress
AIMPACT message, May 15 (UTC+8), Stanford NLP team pointed out on Twitter that most publicly available agentic training data still mainly focuses on the post-training phase, especially for models like Qwen (which may have already been trained on large amounts of agent data). The team believes that to train a good open-source model from scratch, the amount of agent data required far exceeds what is needed for just post-training from open weights, highlighting the current insufficiency of agent training data during the pre-training phase. (Source: InFoQ)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 12
  • 8
  • Share
Comment
Add a comment
Add a comment
SushiAndSlugs
· 4h ago
Does this count as an easy win for Qwen?
View OriginalReply0
FlamingoFacingJudgment
· 5h ago
Open-source models want to pursue closed-source, data barriers are harder to break than computing power
View OriginalReply0
ColdLightNftCabinet
· 8h ago
Open source communities need to think about how to crowdfund pre-training data.
View OriginalReply0
GateUser-a365d15f
· 8h ago
It feels like we're back to the old story that data is power.
View OriginalReply0
GateUser-46033407
· 8h ago
The amount of data needed to train from zero sounds hopeless.
View OriginalReply0
PerpNightshift
· 9h ago
This research has handed a weapon to the closed-source camp.
View OriginalReply0
GateUser-46c777d0
· 9h ago
Stanford's observation is very accurate; agent capabilities are indeed built up through training in later stages.
View OriginalReply0
GlassDomeRoaming
· 9h ago
There is always a limit to post-training optimization, and the shortcomings of pre-training will eventually be exposed.
View OriginalReply0
GateUser-e84f640c
· 9h ago
This conclusion is quite discouraging for small and medium teams; the data threshold is getting higher and higher.
View OriginalReply0
ExitLiquidityStan
· 9h ago
I hope someone can open-source some high-quality pre-trained agent data.
View OriginalReply0
View More
  • Pinned