Google Vision Banana: The "GPT-3 moment" of computer vision? Image generation model defeats specialized visual understanding model.

Question

ME News, April 23 (UTC+8), according to Beating monitoring, a Google team (including authors such as Kaiming He and Saining Xie) published a paper proposing Vision Banana. They performed lightweight instruction fine-tuning on their own image generation model Nano Banana Pro (i.e., Gemini 3 Pro Image) to convert it into a general-purpose visual understanding model. The core approach is to unify the outputs of all visual tasks into RGB images, allowing perception tasks such as segmentation, depth estimation, and surface normal estimation to be completed through image generation without requiring dedicated architectures or training losses for each type of task.

The evaluation covers two major categories: image segmentation and 3D geometric inference. In segmentation, semantic segmentation (labeling each pixel in the image with a category, e.g., "road," "pedestrian," "vehicle") outperforms the dedicated segmentation model SAM 3 by 4.7 percentage points on Cityscapes; referring expression segmentation (finding and segmenting the corresponding object based on natural language descriptions, e.g., "the dog wearing a hat on the left") also surpasses SAM 3 Agent. However, it still lags behind SAM 3 in instance segmentation (distinguishing different individuals of the same category, e.g., separately labeling five dogs in the image). In 3D, metric depth estimation (inferring the actual physical distance from each pixel to the camera from a single photo) achieves an average accuracy of 0.929 across four standard datasets, higher than the dedicated model Depth Anything V3's 0.918, and is trained entirely with synthetic data without using real depth data, requiring no camera parameters during inference. Surface normal estimation (inferring the orientation of object surfaces) achieves state-of-the-art results on three indoor benchmarks.

The fine-tuning merely mixes a small amount of visual task data into the original image generation training data, and the model's image generation capability remains largely unaffected: it ties with the original Nano Banana Pro in generation quality evaluations. The paper argues that image generation pre-training plays a role in the visual domain similar to that of text generation pre-training in the language domain: in the process of learning to generate images, the model has already acquired the internal representations needed to understand images, and instruction fine-tuning simply unleashes them.

(Source: BlockBeats)

Google Vision Banana: The "GPT-3 moment" of computer vision? Image generation model defeats specialized visual understanding model.

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇨🇴vs🇵🇹

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned