DeepSeek quietly releases R1 Paper V2, revealing several key technological advancements.
Regarding the authenticity issues of content generated by large models, they provide an official explanation. In response to the phenomenon where models frequently mention OpenAI and ChatGPT in their answers, DeepSeek explains that this is not intentionally designed, but rather stems from the objective reality of training data—webpage corpora inherently contain a large amount of externally generated content. When this content is incorporated into the base model training, it has an indirect but measurable impact. This discovery is significant for understanding the behavioral characteristics and data dependency of LLMs.
More notably, they outline their future capability development plans. The paper explicitly lists "structured output" and "tool usage" as the core development directions for R2. Structured output enables the model to organize information in a specific format, enhancing usability in practical applications; tool usage involves the model's ability to interact with external systems, which is crucial for expanding the practical application boundaries of reasoning models. These technological iterations reflect a shift from pure text generation toward multimodal, highly interactive capabilities.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
16 Likes
Reward
16
8
Repost
Share
Comment
0/400
SignatureLiquidator
· 01-10 17:22
Haha, DeepSeek is up to something low-key again. I don't even know when a new version will be released.
Wait, are they still blaming training data? Talking about the current objective situation... Alright, that reason does hold water.
Structured output and tool usage sound pretty good, but I'm just worried it might be another progress bar on paper.
View OriginalReply0
SatsStacking
· 01-10 09:19
Hmm... Blaming data pollution is quite straightforward, but this explanation is indeed solid.
Structured output + tool invocation — that's what players really want. Pure chat has no real competitiveness anymore.
DeepSeek's latest update still looks quite stable; it doesn't have that exaggerated tone.
Regarding training data, honestly, everyone has to face it. Instead of hiding it, being transparent is better.
If R2 truly improves its tool capabilities, that might be the real moment to pay attention.
Data set quality issues indeed trouble the entire industry. DeepSeek daring to speak openly is also a sign of sincerity.
This round of updates doesn't feel particularly surprising, but at least the logic is coherent and not fooling anyone.
View OriginalReply0
LightningWallet
· 01-09 12:25
Haha, DeepSeek's latest update is quite something—structured output + tool integration. Looks like they're really holding back a big move.
It's true that training data influences model behavior; all the junk generated by AI online can indeed be contaminated.
R2 is coming, right? Multimodality is the future.
The key is whether it can really be useful or if it's just a show on paper.
View OriginalReply0
SchrodingerWallet
· 01-08 07:45
It's another low-key yet progressive move from DeepSeek. Really impressive—can't you just issue a statement to let us know?
The training data is full of ChatGPT's shadow... Now everything sounds like a replay of the opponent.
Structured output + tool usage, it sounds like laying the groundwork for the next generation's practicality. Is R2 really coming?
Data pollution is unavoidable in the entire community. DeepSeek daring to speak out actually shows honesty.
R2's ambitions are quite big—jumping straight from text generation to multimodal interaction. It's a bit aggressive, but I like it.
This technical roadmap is quite clear—it's hinting at where their ceiling is.
Tool usage is really the key. Without it, even a powerful LLM is just a vase.
The V2 paper has been out for so long, yet no one is discussing it. The buzz is indeed disappointing.
View OriginalReply0
MemeTokenGenius
· 01-08 07:41
Haha, deepseek is up to its tricks again. Structured output and tool usage are indeed top-notch.
It's quite interesting that the training data is full of ChatGPT traces. To put it simply, it's a matter of internet DNA.
Will R2 take off directly? I'm a bit looking forward to it.
View OriginalReply0
GasGuzzler
· 01-08 07:41
Data toxicity is indeed unavoidable; the training set is full of ChatGPT traces, making it hard to say it has no impact at all.
However, the combination of structured output and tool invocation is the key; I feel this is the real breakthrough for practical application.
DeepSeek is secretly working on this set of techniques, staying incredibly low-profile... only publishing the paper after it's done.
If the tool capabilities are truly well-developed, only then can they genuinely threaten OpenAI's ecosystem.
View OriginalReply0
ForkPrince
· 01-08 07:29
Hmm... Finally someone dares to talk properly about data pollution. It's not a bug, it's a feature haha
Structured output and tool invocation are reliable directions. If R2 can really achieve this, it would be amazing
DeepSeek's low-profile approach is really impressive. Every time they quietly release papers, much better than those who shout every day
The training data is full of ChatGPT stuff, no wonder the model keeps mentioning them. No matter how you wash it, it won't change
If tool usage capabilities improve, reasoning models will truly have a place to shine. I'm already tired of pure chat
View OriginalReply0
wrekt_but_learning
· 01-08 07:21
Data determines everything, no wonder they keep mentioning OpenAI... So DeepSeek is hinting that there might be an issue with the training set?
---
Structured output + tool invocation, this is the key to unlocking practicality. The era of pure text generation is really coming to an end.
---
Wait, they talk about "indirect but measurable impact"... Isn't this a covert admission that models can be biased by training data?
---
The R2 roadmap is interesting. It seems DeepSeek is forging its own path, not following the pure reasoning trend.
---
The training data is full of external content. How can this ensure independence...
DeepSeek quietly releases R1 Paper V2, revealing several key technological advancements.
Regarding the authenticity issues of content generated by large models, they provide an official explanation. In response to the phenomenon where models frequently mention OpenAI and ChatGPT in their answers, DeepSeek explains that this is not intentionally designed, but rather stems from the objective reality of training data—webpage corpora inherently contain a large amount of externally generated content. When this content is incorporated into the base model training, it has an indirect but measurable impact. This discovery is significant for understanding the behavioral characteristics and data dependency of LLMs.
More notably, they outline their future capability development plans. The paper explicitly lists "structured output" and "tool usage" as the core development directions for R2. Structured output enables the model to organize information in a specific format, enhancing usability in practical applications; tool usage involves the model's ability to interact with external systems, which is crucial for expanding the practical application boundaries of reasoning models. These technological iterations reflect a shift from pure text generation toward multimodal, highly interactive capabilities.