DeepSeek Multimodal Technology Report: Teaching Models to "Point with Their Fingers," Maze Navigation Surpasses GPT-5.4 by Nearly 17 Percentage Points

robot
Abstract generation in progress

CryptoWorld News reports that DeepSeek has released a multimodal reasoning technical report titled “Thinking with Visual Primitives,” proposing a new reasoning paradigm: when the model is thinking, it points to things like a human using their finger, directly inserting coordinates into the reasoning chain to lock onto each visual object involved. This project is open-sourced on GitHub under the MIT license. A core bottleneck of current multimodal models is the “referential gap”: the model can see the image clearly, but when reasoning, it can only describe visual objects in natural language, making localization difficult in complex scenes. DeepSeek addresses this issue by turning bounding boxes and point coordinates into the minimal units of reasoning. The model is based on the v4-flash architecture, with extreme visual token compression. Test results show excellent performance across multiple benchmarks, especially significantly outperforming other models in topological reasoning and maze navigation tasks.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin