Vision and language are completely separated, which means that future human-computer interaction may become purely dialogue-driven interfaces, and the interaction paradigm is about to change dramatically.

View Original
CoinNetwork
CoinWorld News reports that AI Auntie pointed out in the Latent Space podcast that current video and image generation models do not truly understand the physical world; diffusion models are essentially pixel-level renderers that lack physical cognition capabilities. Taking NVIDIA's Cosmos model as an example, the core diffusion model responsible for rendering images has only 7 billion parameters, while the real intelligence core is a large language model (LLM) serving as a prompt rewriter. The logical quality and matching degree of the final video almost entirely depend on the quality of prompt rewriting by the language model, not the diffusion model itself. This decoupling of vision and language indicates that human-computer interaction will undergo a thorough restructuring.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned