Google Vision Banana: The "GPT-3 moment" of computer vision? Image generation model defeats specialized visual understanding model.

ME News, April 23 (UTC+8), according to Beating monitoring, a Google team (including authors such as Kaiming He and Saining Xie) published a paper proposing Vision Banana. They performed lightweight instruction fine-tuning on their own image generation model Nano Banana Pro (i.e., Gemini 3 Pro Image) to convert it into a general-purpose visual understanding model. The core approach is to unify the outputs of all visual tasks into RGB images, allowing perception tasks such as segmentation, depth estimation, and surface normal estimation to be completed through image generation without requiring dedicated architectures or training losses for each type of task.

The evaluation covers two major categories: image segmentation and 3D geometric inference. In segmentation, semantic segmentation (labeling each pixel in the image with a category, e.g., "road," "pedestrian," "vehicle") outperforms the dedicated segmentation model SAM 3 by 4.7 percentage points on Cityscapes; referring expression segmentation (finding and segmenting the corresponding object based on natural language descriptions, e.g., "the dog wearing a hat on the left") also surpasses SAM 3 Agent. However, it still lags behind SAM 3 in instance segmentation (distinguishing different individuals of the same category, e.g., separately labeling five dogs in the image). In 3D, metric depth estimation (inferring the actual physical distance from each pixel to the camera from a single photo) achieves an average accuracy of 0.929 across four standard datasets, higher than the dedicated model Depth Anything V3's 0.918, and is trained entirely with synthetic data without using real depth data, requiring no camera parameters during inference. Surface normal estimation (inferring the orientation of object surfaces) achieves state-of-the-art results on three indoor benchmarks.

The fine-tuning merely mixes a small amount of visual task data into the original image generation training data, and the model's image generation capability remains largely unaffected: it ties with the original Nano Banana Pro in generation quality evaluations. The paper argues that image generation pre-training plays a role in the visual domain similar to that of text generation pre-training in the language domain: in the process of learning to generate images, the model has already acquired the internal representations needed to understand images, and instruction fine-tuning simply unleashes them.

(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned