DeepSeek launches an image recognition mode, supporting visual CoT reasoning based on the withdrawn primitive framework

robot
Abstract generation in progress

According to Beating Monitoring, DeepSeek’s web and app clients officially launched Vision Mode (image understanding). It is placed above the chat input box, alongside Quick Mode and Expert Mode. The newly launched visual understanding capability is not just optical character recognition (OCR); it is designed for in-depth scene analysis, spatial logical reasoning, and converting UI screenshots directly into structured HTML code. For high-difficulty geometric derivations or complex chart analysis, the system automatically activates a deep thinking model to provide a complete chain of reasoning.

Vision Mode’s underlying technology is based on the research framework “Thinking with Visual Primitives” published by the DeepSeek team. A paper jointly authored by multimodal researcher Xiaokang Chen with Peking University and Tsinghua University points out that current visual language models suffer from a “Reference Gap” in fine-grained localization and spatial reasoning—meaning it is difficult to describe complex visual coordinates using vague natural language. To address this, the research team upgrades coordinate points and bounding boxes to the minimal units of thought, inserting spatial primitives directly into the model’s reasoning chain (CoT) for visual inference, enabling spatial referencing to happen in parallel during the thinking process.

The foundational academic paper and open-source project for the visual capability were briefly released on April 30, but were immediately withdrawn without notice by DeepSeek on May 1, triggering widespread industry speculation about excessive disclosure of technical details and the model’s subsequent optimization. The officially launched Vision Mode supports only image input at present; it does not yet support multimodal formats such as video or audio, and the model currently has no image generation capability.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned