Stanford NLP: Most publicly available agent training data still concentrates on the post-training phase

robot
Abstract generation in progress
AIMPACT message, May 15 (UTC+8), the Stanford NLP team pointed out on Twitter that most publicly available agentic training data still mainly focuses on the post-training phase, especially for models like Qwen (which may have already been trained on large amounts of agent data). The team believes that to train a good open-source model from scratch, the amount of agent data required far exceeds what is needed for just post-training on open weights, highlighting the current insufficiency of agent training data during the pre-training phase. (Source: InFoQ)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 5
  • 8
  • Share
Comment
Add a comment
Add a comment
NeonMeltsIceCream
· 5h ago
Open-source models want to catch up, but the data cost for agents during the pre-training phase is too high for small teams to afford.
View OriginalReply0
VineGeometry
· 5h ago
It seems that everyone is now competing in post-training, but the real moat is the data barrier for pre-training.
View OriginalReply0
DexterRamen
· 5h ago
Qwen was called out haha, but it is indeed one of the most prominent agent capabilities in open source.
View OriginalReply0
GateUser-9568ced5
· 5h ago
The gap in pre-training data is quite critical; no matter how strong the post-training is, it can't fill the foundation.
View OriginalReply0
Can'tSleepWithoutSigningThe
· 5h ago
Stanford's perspective is interesting; the gap in data scale for intelligent agents is larger than expected.
View OriginalReply0
  • Pinned