What does the new paper deleted overnight by DeepSeek actually say?

Question

Last night, DeepSeek multimodal researcher Chen Xiaokang posted a tweet on X and announced a new paper from DeepSeek on multimodal technology, titled “Thinking with Visual Primitives,” saying “Excited to release.”

Early this morning, the tweet was deleted, and the paper on GitHub was also taken down.

But APPSO read the full text before it disappeared. After reading it, they felt that the paper’s withdrawal might not be due to content issues.

Quite the opposite, it might have revealed too much.

The day before yesterday, we just tested DeepSeek’s image recognition mode—had it count fingers, it thought for a bit, then complained “I’m really dizzy from counting,” and then answered incorrectly. At the time, we thought it was a minor issue during the testing phase.

This paper tells us that the dizziness when counting fingers behind the scenes hints at a technical bottleneck that GPT, Claude, Gemini collectively haven’t solved.

And the solution DeepSeek offers is almost laughably simple: equip AI with a finger.

Chen Xiaokang wrote in that tweet:

“Traditional CoT stays in the linguistic space, but visual reasoning needs more. By using points and boxes as cognitive anchors, our model bridges the Reference Gap—mimicking the ‘point-to-reason’ synergy humans use.”

“Traditional thinking chains stay in the language space, but visual reasoning needs more. By using points and boxes as cognitive anchors, our model bridges the ‘reference gap,’ simulating the human ‘point-and-think’ collaboration.”

Seeing clearly and pointing accurately are two different things.

Currently, all multimodal large models doing image reasoning essentially convert the visual scene into text, then perform chain reasoning in the text space. GPT-5.4, Claude-Sonnet-4.6, Gemini-3-Flash—all follow this approach.

Over the past two years, the focus of improvements by OpenAI, Google, and Anthropic has been on one problem: how to make models see more clearly. High-resolution cropping, dynamic tiling, enlarging images before feeding in. DeepSeek calls this the Perception Gap.

But this paper points out another bottleneck: the Reference Gap. The model sees clearly but cannot precisely point to specific objects in the reasoning process.

You can think of it this way: in a picture with 25 densely packed people, describing “the one next to the person in blue in the third row on the left” is inherently vague. As the model counts, it loses the context and forgets whom it just counted.

How do humans solve this problem? Quite primitively: extend a finger, point to a number.

A model with 284 billion parameters, equipped with a finger.

DeepSeek’s approach: let the model output coordinates directly during thinking.

Imagine the model sees a picture with many people; its chain of thought is no longer “I see someone in blue on the left,” but “I see this person,” then attach a box coordinate to highlight the person. Count each person by drawing a box around them; after marking all, just count the boxes.

Two coordinate formats: one is a bounding box, a rectangle that encloses the object, suitable for locating objects; the other is a point, a specific position on the image, suitable for tracking paths or maze navigation. DeepSeek calls these “visual primitives,” the smallest units of thought.

The key change here: previously, the model output coordinates as the final answer (“the target is here”), now the coordinates are embedded within the thinking process itself. Coordinates are like marks on rough draft paper, not the final answer on the test paper.

Compressing an image by 7,056 times and still being able to count how many people are in it.

The base model is DeepSeek-V4-Flash, a 284B parameter MoE model. MoE means: the model’s brain is very large, but each answer only activates a small subset of neurons, with only about 13B parameters active during inference. Similar to a hundred-person team, only five people are assigned to each task.

On the visual encoder side, there are three levels of compression. For example: if you have a photo to send to a friend with slow internet, first cut it into small squares; second, merge every 9 small squares into one (3×3 compression); third, further streamline redundant information during transmission (KV Cache compressed 4 times).

In actual numbers: a 756×756 image, 570k pixels, compressed down to 81 information units. A compression ratio of 7,056 times.

My first reaction to this number was: can you still see clearly? But the results in the paper show that yes, you can. Not only see clearly but also count exactly 25 people in the picture.

Compare: same 800×800 image, Gemini-3-Flash consumes about 1,100 tokens to represent it, Claude-Sonnet-4.6 about 870 tokens, GPT-5.4 about 740 tokens. DeepSeek only uses 90 information units in the final calculation. Others use over a thousand grids to remember an image; DeepSeek only needs 90 grids, then uses the saved computing power to “point.”

How the 8B training data points were gathered

DeepSeek crawled all datasets labeled “object detection” from platforms like Huggingface, initially screening to 97,984 data sources.

Then, two rounds of filtering.

First, check label quality. Use AI to automatically review three issues: labels are meaningless numbers (category names like “0” or “1”), labels are personal entities (“MyRoommate”), labels are ambiguous abbreviations (“OK” or “NG” in industrial inspection; an apple “OK” vs. a circuit board “OK” look very different, AI can’t learn this). This round removed 56%, leaving 43,141.

Second, check bounding box quality. Three standards: missing too many labels (only half labeled), boxes are skewed or cut off objects in half, boxes so large they encompass the entire image (indicating the original data was converted from classification to detection without localization info). Further, 27% are cut, leaving 31,701.

Finally, sample by category, deduplicate, and produce over 40 million high-quality samples.

DeepSeek first enlarges the bounding box data, then supplements the point data later. The reason is simple: labeling a box yields a nearly unique answer (just enclose the object); labeling a point is more ambiguous—any position on the object counts, with no single correct answer, making training signals fuzzy. Also, a box contains two points (top-left and bottom-right), so learning to draw boxes reduces to a lower-dimensional operation for point labeling.

How to teach the “pointing” ability to the model

The post-training strategy is “train separately first, then merge.”

DeepSeek first trains a specialized box-drawing expert model on bounding box data, then trains a point-labeling expert model on point data. Separate training because the data volume isn’t large enough, and mixing the two abilities could cause interference.

Then, apply reinforcement learning to both experts. How to judge if the model “drew the box correctly” or “walked the right path”? DeepSeek designed a multi-dimensional scoring system: format correctness (are the coordinates syntactically valid), logical consistency (is the reasoning process self-consistent), accuracy (how close the final result is to the standard answer).

Data filtering during reinforcement learning is also carefully designed: first, run the model N times on the same question; questions answered all correct are too easy to train on, questions answered all wrong are too hard to learn from, only keep questions with mixed correctness for training.

The final step is to merge the capabilities of the two experts into one model. The specific method: let a unified model learn from the outputs of both experts, similar to a student learning different subjects from two teachers.

After giving it a finger, how does it count?

Counting 25 people

Give the model a group photo of a football team, ask “How many people are in the picture?”

Thinking process: first determine “this is a team photo, count everyone, including players and coaches.” Then output 25 box coordinates at once, drawing a box around each person. Then tally: 4 in the front row + 9 in the middle + 8 in the back + 2 coaches on the left + 2 coaches on the right = 25.

“How many bears are on the ground?”

There are three bears in the picture. The model draws a box around each and judges their positions: the first on the tree trunk climbing vertically, exclude; the second walking near the rocks, count; the third between broken wood and mud, count. Final answer: 2 bears.

It doesn’t first count three and then subtract one; instead, it judges “is this on the ground” for each bear, with a specific coordinate anchor behind each judgment. It is truly checking each one individually, not guessing.

Multi-hop spatial reasoning

In a 3D rendered scene with colorful geometric shapes, the question: “Does a purple rubber object of the same size as a gray metal object exist?”

The model first boxes the gray metal sphere, confirming it’s a small object. Then, sequentially box other small objects in the scene: a brown metal cylinder, a blue metal cube, a blue rubber cube, a yellow rubber cylinder… checking color, material, size attribute by attribute. Conclusion: no purple rubber object exists.

Six positioning and judgment steps. Each step has a coordinate anchor, avoiding “wait, where did I just check?”

More case references in the paper:

Maze navigation: others flip coins, DeepSeek is really searching.

The paper tested four tasks, with maze being the most challenging.

The task is straightforward: given a maze image, ask whether there’s a path from start to finish, and if so, draw it. Mazes come in three shapes: grid, ring, honeycomb.

The model explores the maze like you did as a kid with a pencil on paper: pick a fork, go to the end, backtrack if dead-end. The difference is it marks a coordinate on the map at each step, leaving a record.

The paper shows a complete process for a circular maze: the model first marks start and end points, then begins exploring. After 18 steps, it hits dead ends twice and backtracks, finally finding a path and outputting the sequence of coordinate points along the route.

DeepSeek also designed trap mazes: look like there’s a path, but part of it is secretly blocked. These test patience; the model can’t just look at the start area and conclude, it must try all possible routes to confirm it’s blocked.

Accuracy comparison:

DeepSeek: 66.9%
GPT-5.4: 50.6%
Claude-Sonnet-4.6: 48.9%
Gemini-3-Flash: 49.4%
Qwen3-VL: 49.6%

Maze solutions have only two answers: path exists or not. Random guessing hits 50%. GPT, Claude, Gemini, Qwen hover around 50%, like flipping a coin. DeepSeek’s 66.9% isn’t high, but it is actually walking step by step, not guessing.

Path tracking: the ultimate version for finding flaws

This task is more intuitive: a bunch of lines tangled together, each from one marker to another. What your headphones look like when pulled from your pocket. The question: which endpoint does line C lead to?

The model outputs coordinate points along the line, like a finger tracing over paper. Curved parts have dense points, straight segments sparse. When humans follow a line with their eyes, they do the same: slow down at curves, sweep through straight lines.

The paper also added a harder test: all lines have the same thickness and color. Can’t rely on color to distinguish lines, only on the continuity of the curve to decide which line to follow at intersections.

DeepSeek: 56.7%
GPT-5.4: 46.5%
Claude-Sonnet-4.6: 30.6%
Gemini-3-Flash: 41.4%

Claude’s 30.6% is a bit surprising. The endpoint options are usually four or five; random guessing would be over 20%. 30.6% only slightly better than guessing. Maybe in this pure spatial tracking task, language reasoning inertia actually hinders performance.

How to teach AI to navigate mazes without cheating

Maze training faces a practical problem: if only the final answer correctness is scored, the model quickly learns to guess rather than search. It might answer wrong after searching, or guess right without searching, both yielding zero points.

DeepSeek’s solution is to include the process in the scoring. Each valid exploration step earns points, hitting walls deducts points, the farther it explores, the better. Even if it doesn’t reach the end, as long as it searches most of the area seriously, it can get a decent score. This discourages lazy behavior.

Harder mazes require more: not only say “no path,” but also prove that all reachable places have been explored. Search coverage is also scored.

A little Easter egg, three limitations

The post-training data contains no Chinese. But the model can use Chinese for visual primitives reasoning.

Given a photo of a coffee machine, asking “how to make a latte” in Chinese, it labels the steam wand, milk jug, coffee beans, latte button with coordinate annotations, then provides step-by-step instructions. Multilingual ability is inherited from the base model; training on visual primitives didn’t break it.

It can also combine visual understanding with world knowledge: given a photo of the Golden Gate Bridge, asking “Are there NBA teams nearby?” It first boxes the bridge, infers it’s San Francisco, then answers with the Golden State Warriors.

It understands humor: a natural spot on a fruit cross-section coincidentally forms a sad cat face, and the model can point out where the similarity is and explain why it’s funny.

It can give escape room guidance: box the high-up key, the chair on the floor, the locked door, and suggest “move the chair under the key → step on it to reach the key → go open the door.”

The paper honestly states what it currently cannot do.

Input resolution is limited. ViT outputs are constrained between 81 and 384 visual information units; in very detailed scenes (like counting fingers), coordinate precision isn’t enough. This might be the direct reason for the finger-counting failure during the earlier test.

Currently, a specific trigger phrase is needed to activate the visual primitives mode. The model can’t autonomously decide “I should extend my finger for this question” without prompting.

Topological reasoning generalization is limited. It performs well on trained maze types but may fail on new spatial structures. Chen Xiaokang also mentioned in that deleted tweet:

“We’re still in the early stages; generalization in complex topological reasoning tasks isn’t perfect yet, but we’re committed to solving it.”

The finger-counting dizziness during the earlier test is a live demonstration of the Reference Gap. In images with overlapping fingers, relying solely on language to distinguish “the third from the left” versus “the second from the right” is as confusing as counting a crowd without extending your own finger—inevitably chaotic.

This paper points toward the next evolution in multimodal reasoning: anchoring mechanisms. DeepSeek achieves the same effect with only 90 information units that others require thousands of tokens for, saving computational resources to let the model “think while pointing.”

The resolution arms race can slow down; teaching the model to extend a finger is more effective than equipping it with more expensive glasses.

After the whale opened its eyes, it grew fingers. The 66.9% maze accuracy is still far from perfect, but at least it’s walking seriously, unlike those neighbors flipping coins.

View Original

What does the new paper deleted overnight by DeepSeek actually say?

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin