The capabilities of large AI models have already surpassed those of most humans in certain areas, such as programming and mathematics. Reports indicate that internally, Anthropic has nearly achieved 100% AI programming, and Google’s Gemini Deep Think has solved 5 out of 6 problems in IMO 2025, reaching gold medal standards.

However, in visual reasoning, even the advanced Gemini 3 Pro only reaches the level of a 3-year-old child on the BabyVision benchmark, which tests basic visual reasoning abilities.

Why are large models so strong in programming and math but weak in visual reasoning? This is because their “thinking methods” have limitations. Visual language models (VLM) need to first convert visual inputs into language, then perform text-based reasoning. But many visual tasks cannot be precisely described with words, which results in poor visual reasoning capabilities.

Andrew Dai, who has worked at Google DeepMind for 14 years, teamed up with Apple’s senior AI expert Yinfei Yang to found a company called Elorian AI. Their goal is to elevate models’ visual reasoning from “child level” to “adult level,” and to enable models to think natively within the “visual space,” thereby challenging physical-world AGI.

Elorian AI has secured $55 million in early funding from Striker Venture Partners, Menlo Ventures, and Altimeter, with participation from 49 Palms and top AI scientists including Jeff Dean.

Pioneering multimodal models, aiming to endow visual models with reasoning abilities

Andrew Dai, of Chinese descent, holds a bachelor’s in computer science from Cambridge and a PhD in machine learning from Edinburgh. During his PhD, he interned at Google, and in 2012, he joined Google, where he stayed for 14 years before starting his own venture.

Image source: Andrew Dai’s LinkedIn

Shortly after joining Google, he co-authored the first paper on language model pretraining and supervised fine-tuning, “Semi-supervised Sequence Learning,” laying the foundation for GPT’s emergence. Another foundational paper he authored is “Glam: Efficient scaling of language models with mixture-of-experts,” which paved the way for the current mainstream MoE architecture.

Image source: Google

During his time at Google, he was deeply involved in nearly all large model training efforts, from Plam to Gemini 1.5 and Gemini 2.5. Under Jeff Dean’s arrangement, in 2023, he began leading the data team for Gemini (including synthetic data), which later expanded to hundreds of members.

Image source: Yinfei Yang’s LinkedIn

Yinfei Yang, co-founder of Elorian AI, previously worked at Google Research for four years, focusing on multimodal representation learning. He later joined Apple, responsible for developing multimodal models.

Image source: arXiv

His notable research, “Scaling up visual and vision-language representation learning with noisy text supervision,” has advanced the development of multimodal representation learning.

Elorian AI’s co-founders also include Seth Neel, a former assistant professor at Harvard University and an expert in data and AI.

Why discuss the pioneering papers written by Elorian AI’s co-founders? Because what they aim to do isn’t just engineering optimization but a paradigm shift at the fundamental architecture level—upgrading AI from text-based understanding to visual-based understanding.

Currently, AI models perform well on text-based tasks, but even the most advanced multimodal large models still stumble on basic visual alignment (Visual grounding) tasks.

For example, how to precisely fit a part into a mechanical device so it operates more accurately and efficiently? Such spatial-physical tasks are simple for elementary school students but challenging for existing multimodal models.

This still requires insights from biology. In the human brain, vision underpins many thinking processes. Humans utilize visual and spatial reasoning abilities that far predate language-based logical reasoning.

For instance, teaching someone to navigate a maze with words can confuse them, but drawing a sketch allows instant understanding.

Similarly, even a bird, which doesn’t use language, can recognize and reason about geographic features through vision, enabling global migration. This is a strong signal indicating that advancing machine reasoning likely depends heavily on visual evolution.

Imagine if, from the very beginning of model development, we tried to embed this biological visual instinct into AI’s genes—building a native multimodal model capable of “simultaneously understanding and processing text, images, videos, and audio”—giving the model visual comprehension. Andrew Dai and his team aim to create a “sensory integrator,” teaching machines not only to “see” the world but also to “understand” it.

In their view, deeply understanding the “physical world” is the key to leapfrogging to the next generation of machine intelligence and ultimately reaching “visual general artificial intelligence (Visual AGI).”

Post-reasoning VLMs are not the right path to visual reasoning

Many teams have attempted this before. Andrew Dai, when part of the Gemini team, was already leading a top multimodal research group globally. But traditional multimodal models still mainly rely on VLMs (vision-language models), which operate on a “two-step” logic: first converting visual inputs into language, then performing text-based reasoning (sometimes with external tools).

However, post-hoc reasoning has inherent limitations. It’s prone to model hallucinations, and many visual tasks cannot be accurately described with words.

Moreover, models like NanoBanana excel at multimodal generation, but generation ability does not equal reasoning ability. Their “thinking” before generating still fundamentally depends on language models, not native reasoning.

To develop models that truly grasp the spatial, structural,

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Gate13thAnniversaryLive
1.24M Popularity
#
WCTCTradingChallengeShare8MUSDT
802.92K Popularity
#
BitcoinBouncesBack
224K Popularity
#
EthereumMemeSeasonReturns
2.01M Popularity
#
USIranTalksProgress
890.1K Popularity

Sitemap

Using "visual reasoning" to explore the physical world AGI, ElorianAI raises $55 million in funding

Trending Topics

Gate13thAnniversaryLive

WCTCTradingChallengeShare8MUSDT

BitcoinBouncesBack

EthereumMemeSeasonReturns

USIranTalksProgress

Pin