DeepSeek releases a multimodal reasoning report "thinking with visual primitives," proposing to embed coordinates directly into the reasoning chain, locking onto each visual object as if pointing with a finger. Using bounding boxes/point coordinates as the minimal reasoning units, it alleviates the reference gap. Based on the v4-flash architecture, it extremely compresses visual tokens, leading in tasks such as topological reasoning and maze navigation. Open source on GitHub, MIT license.

CoinNetwork

2026-04-30 23:50:34

Abstract generation in progress

CryptoWorld News reports that DeepSeek has released a multimodal reasoning technical report titled “Thinking with Visual Primitives,” proposing a new reasoning paradigm: when the model is thinking, it points to things like a human using their finger, directly inserting coordinates into the reasoning chain to lock onto each visual object involved. This project is open-sourced on GitHub under the MIT license. A core bottleneck of current multimodal models is the “referential gap”: the model can see the image clearly, but when reasoning, it can only describe visual objects in natural language, making localization difficult in complex scenes. DeepSeek addresses this issue by turning bounding boxes and point coordinates into the minimal units of reasoning. The model is based on the v4-flash architecture, with extreme visual token compression. Test results show excellent performance across multiple benchmarks, especially significantly outperforming other models in topological reasoning and maze navigation tasks.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
495.56K Popularity
#
USSeeksStrategicBitcoinReserve
58.72M Popularity
#
BitcoinETFOptionLimitQuadruples
1M Popularity
#
#FedHoldsRateButDividesDeepen
32.51K Popularity
#
DeFiLossesTop600MInApril
10.18M Popularity

Sitemap

DeepSeek Multimodal Technology Report: Teaching Models to "Point with Their Fingers," Maze Navigation Surpasses GPT-5.4 by Nearly 17 Percentage Points

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin