A world model appearance of LeCun! Meta shocked the release of the first "humanoid" model, which completes half a picture after understanding the world, and self-supervised learning is expected by everyone
**Introduction:**LeCun's world model is finally here, it can be said to be what everyone expects. Now that the big model has learned to understand the world and reason like a human, isn't AGI not far away?
For a long time, LeCun's ideal AI has always been the AI that leads to the human level. For this reason, he proposed the concept of "world model".
Recently, in a public speech, LeCun once again criticized the GPT large model: the large model of autoregressive generation based on probability cannot break the hallucination problem at all. It even directly asserts that the GPT model will not survive 5 years.
Today, LeCun is finally one step closer to his dream!
Meta shock released a "human-like" artificial intelligence model I-JEPA, which can analyze and complete missing images more accurately than existing models.
Paper address:
Bottom line: When I-JEPA fills in the missing pieces, it uses background knowledge about the world! Instead of just looking at nearby pixels like other models do.
It has been more than a year since the concept of "world model" was proposed, and LeCun is about to realize his own star sea.
Today, the training code and models are open-sourced. The paper will be presented at CVPR 2023 next week.
LeCun's world model is here
Even today's most advanced AI systems have been unable to break through some key limitations.
In order to break through this layer of shackles, Meta's chief AI scientist Yann LeCun proposed a new architecture.
His vision is to create a machine that can learn an internal model of how the world works, so it can learn more quickly, plan for complex tasks, and respond to new and unfamiliar situations at any time.
The image joint embedded prediction framework I-JEPA model launched by Meta today is the first AI model in history based on a key part of LeCun's world model vision.
I-JEPA learns by creating an internal model of the external world. In the process of completing images, it compares abstract representations of the images, rather than comparing the pixels themselves.
I-JEPA has shown strong performance on multiple computer vision tasks and is much more computationally efficient than other widely used CV models.
ImageNet Linear Evaluation: The I-JEPA method does not use any visual data augmentation during pre-training to learn semantic image representations, using less computation than other methods
The representations learned by I-JEPA can be used in many different applications without extensive fine-tuning.
For example, the researchers used 16 A100 GPUs within 72 hours to train a visual Transformer model with 632M parameters.
On the low-shot classification task on ImageNet, it achieves state-of-the-art, down to 12 labeled examples per class.
Other methods typically require 2 to 10 times as many GPU hours and have higher error rates when trained with the same amount of data.
Acquire common sense through self-supervised learning
In general, humans can learn a great deal of background knowledge about the world simply by passive observation.
Speculatively, it seems that this kind of commonsense information is the key to enabling intelligent behavior, such as acquiring valid samples of new concepts, foundations, and plans.
Model concept learning as learning a linear readout
Meta's work on I-JEPA (and more generally the Joint Embedding Prediction Architecture JEPA model) is based on this fact.
What researchers have tried is to devise a learning algorithm that captures commonsense background knowledge about the world and then encodes it into a digital representation that the algorithm can access.
To be efficient enough, systems must learn these representations in a self-supervised fashion—that is, directly from unlabeled data such as images or sounds, rather than from manually assembled labeled datasets.
At a higher level, JEPA aims to predict representations of parts of an input based on representations of other parts of the same input (image or text).
Because it does not involve collapsing multiple views/augmented representations of an image into a single point, JEPA holds great promise to avoid biases and issues that arise in widely used methods (i.e., invariance-based pre-training).
A joint embedding approach avoids representation collapse
At the same time, by predicting representations at a highly abstract level, rather than directly predicting pixel values, JEPA promises to be able to directly learn useful representations while avoiding the limitations of generative methods. Excited for big language models.
In contrast, general generative models learn by removing or distorting parts of the input model.
For example, erase part of a photo, or hide certain words in a text paragraph, and then try to predict corrupted or missing pixels or words.
But a significant shortcoming of this approach is that while the world itself is unpredictable, the model tries to fill in every missing piece of information.
As a result, such approaches can make mistakes that humans would never make, because they focus too much on irrelevant details instead of capturing higher-level, predictable concepts.
A well-known example is that generative models have difficulty generating the right hands.
In the general architecture of self-supervised learning, the system learns to capture the relationship between different inputs.
Its goal is to assign high energies to incompatible inputs and low energies to compatible inputs.
Common Architectures for Self-Supervised Learning
The difference between these three structures is-
(a) A joint embedding (invariant) architecture learns to output similar embeddings for compatible inputs x, y and dissimilar embeddings for incompatible inputs.
(b) A generative architecture learns to reconstruct a signal y directly from a compatible signal x, using a decoder network conditioned on an additional variable z (possibly a latent variable) to facilitate reconstruction.
(c) The joint embedding prediction architecture learns to predict the embedding of signal y from compatible signal x, using a prediction network conditioned on an additional variable z (possibly a latent variable) to facilitate the prediction.
joint embedding prediction architecture
The principle behind I-JEPA is to predict missing information through an abstract representation more akin to human understanding.
In order to guide I-JEPA to generate semantic representations, one of the core designs is the multi-block masking strategy.
Specifically, the team demonstrated the importance of predicting large chunks that contain semantic information. These chunks are of sufficient size to cover important semantic features.
The advantage of this strategy is that it reduces unnecessary details and provides a higher level of semantic understanding.
By focusing on large chunks of semantic information, the model can better capture important concepts in images or texts, leading to stronger predictive capabilities.
Image-based Joint Embedding Prediction Architecture (I-JEPA) uses a single contextual block to predict representations from the same image
Among them, the context encoder is a visual Transformer (ViT), which only processes visible context patches.
The predictor is a narrow ViT that takes the output of the context encoder and predicts the representation of the target block based on the target’s position token.
The target representation corresponds to the output of the target encoder, whose weights are updated at each iteration by an exponential moving average of the context encoder weights.
In I-JEPA, the predictor can be viewed as a primitive (and constrained) world model capable of exploiting known context information to infer the content of unknown regions.
This capability enables the model to reason about static images, building an understanding of spatial uncertainty in images.
Different from methods that only focus on pixel-level details, I-JEPA is able to predict high-level semantic information of unseen regions, so as to better capture the semantic content of images.
The process by which a predictor learns to model the semantics of the world
For each image, the parts outside the blue box are encoded and provided to the predictor as context. The predictor, on the other hand, outputs a representation representing what is expected inside the blue box.
To understand what the model captures, the team trained a stochastic decoder to map the I-JEPA predicted representations back to pixel space, showing the model's output when making predictions within the blue box.
Clearly, the predictor is able to identify the semantic information that should be filled in (top of a dog's head, bird's leg, wolf's leg, the other side of a building).
Given an image, randomly sample 4 target patches, randomly sample a range-scale context patch, and remove any overlapping target patches. Under this strategy, the target block is relatively semantic, and the context block has a large amount of information, but it is very sparse, so the processing efficiency is high
In short, I-JEPA is able to learn high-level representations of object parts without discarding their local location information in the image.
Higher efficiency, stronger performance
In pre-training, the calculation of I-JEPA is more efficient.
First, it does not need to apply more computationally intensive data augmentation to generate multiple views, thus incurring no additional overhead.
Second, the target encoder only needs to process one view of the image, and the context encoder only needs to process the context block.
Experiments demonstrate that I-JEPA is able to learn powerful off-the-shelf semantic representations without artificial view augmentation.
In addition, I-JEPA also outperforms pixel reconstruction and token reconstruction methods in ImageNet-1K linear detection and semi-supervised evaluation.
Benchmark Linear Evaluation Performance on ImageNet-1k as a Function of GPU Hours During Pretraining
On semantic tasks, I-JEPA outperforms previous pre-training methods that rely on artificial data for augmentation.
Compared with these methods, I-JEPA achieves better performance on low-level vision tasks such as object counting and depth prediction.
By using a simpler and more flexible inductive bias model, I-JEPA can be used on a wider range of tasks.
Low-shot classification accuracy: semi-supervised evaluation on ImageNet-1k with 1% labels (about 12 labeled images per class)
AI takes human intelligence a step further
I-JEPA demonstrates the potential of the architecture to learn off-the-shelf image representations without additional assistance from hand-crafted knowledge.
Advancing JEPA to learn more general world models from richer modalities would be particularly rewarding work.
For example, from a short context, make long-range spatial and temporal predictions on videos and condition these predictions based on audio or text cues.
Visualization of the I-JEPA predictor representation: the first column contains the original image, the second column contains the context image, and the green bounding boxes contain samples from the generative model decoded by the predictor output. The predictor correctly captures the positional uncertainty, producing high-level object parts with the correct pose, discarding precise low-level details and background information
The team says it looks forward to extending the JEPA approach to other domains, such as image-text paired data and video data.
In the future, JEPA models may have exciting applications in tasks such as video understanding. And it will be an important step towards applying and extending self-supervised methods to learn world models.
Pre-trained model
### Single GPU Training
In a single GPU setup, the implementation starts in main.py.
For example, to run I-JEPA pretraining on GPUs 0, 1, and 2 on your local machine using the configuration configs/in1k_vith14_ep300.yaml, enter the following command:
NOTE: The ViT-H/14 configuration should be run on 16 A100 80G graphics cards with an effective batch size of 2048 to reproduce the results.
Multiple GPU Training
In a multi-GPU setup, the implementation starts in main_distributed.py, which allows specifying details about distributed training in addition to parsing configuration files.
For distributed training, the popular open source submitit tool is required, with an example of a SLURM cluster.
For example, to pre-train on 16 A100 80G graphics cards using the pre-training experiment configuration specified in configs/in1k_vith14_ep300.yaml, enter the following command:
Netizens expressed their appreciation for this new work led by LeCun.
Really groundbreaking work, blown away. The successor to the autoregressive model is here!
I believe that federated embedding architectures are the future of AI, not generative. But I'm just curious, why don't we go further into multimodality (like ImageBind, not just text-image pairs), and replace VIT encoders with perceptrons like encoders?
Very neat work. In my understanding it is similar to a masked autoencoder, but loses features when defined in latent space, not input/pixel space. However, if I want to understand it in detail, I still need more details.
My brain can only understand 10% of the paper, but if I-JEPA can really create the target image in Figure 3, it will be amazing, and most importantly: it is related to AI-generated MMORPG!
This project is about to be open-sourced, and netizens also expressed appreciation for Meta's contribution to the open-source community.
References:
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
A world model appearance of LeCun! Meta shocked the release of the first "humanoid" model, which completes half a picture after understanding the world, and self-supervised learning is expected by everyone
**Source:**Xinzhiyuan
**Introduction:**LeCun's world model is finally here, it can be said to be what everyone expects. Now that the big model has learned to understand the world and reason like a human, isn't AGI not far away?
For a long time, LeCun's ideal AI has always been the AI that leads to the human level. For this reason, he proposed the concept of "world model".
Recently, in a public speech, LeCun once again criticized the GPT large model: the large model of autoregressive generation based on probability cannot break the hallucination problem at all. It even directly asserts that the GPT model will not survive 5 years.
Meta shock released a "human-like" artificial intelligence model I-JEPA, which can analyze and complete missing images more accurately than existing models.
Bottom line: When I-JEPA fills in the missing pieces, it uses background knowledge about the world! Instead of just looking at nearby pixels like other models do.
It has been more than a year since the concept of "world model" was proposed, and LeCun is about to realize his own star sea.
Today, the training code and models are open-sourced. The paper will be presented at CVPR 2023 next week.
LeCun's world model is here
Even today's most advanced AI systems have been unable to break through some key limitations.
In order to break through this layer of shackles, Meta's chief AI scientist Yann LeCun proposed a new architecture.
The image joint embedded prediction framework I-JEPA model launched by Meta today is the first AI model in history based on a key part of LeCun's world model vision.
I-JEPA learns by creating an internal model of the external world. In the process of completing images, it compares abstract representations of the images, rather than comparing the pixels themselves.
I-JEPA has shown strong performance on multiple computer vision tasks and is much more computationally efficient than other widely used CV models.
The representations learned by I-JEPA can be used in many different applications without extensive fine-tuning.
For example, the researchers used 16 A100 GPUs within 72 hours to train a visual Transformer model with 632M parameters.
On the low-shot classification task on ImageNet, it achieves state-of-the-art, down to 12 labeled examples per class.
Other methods typically require 2 to 10 times as many GPU hours and have higher error rates when trained with the same amount of data.
Acquire common sense through self-supervised learning
In general, humans can learn a great deal of background knowledge about the world simply by passive observation.
Speculatively, it seems that this kind of commonsense information is the key to enabling intelligent behavior, such as acquiring valid samples of new concepts, foundations, and plans.
Meta's work on I-JEPA (and more generally the Joint Embedding Prediction Architecture JEPA model) is based on this fact.
What researchers have tried is to devise a learning algorithm that captures commonsense background knowledge about the world and then encodes it into a digital representation that the algorithm can access.
To be efficient enough, systems must learn these representations in a self-supervised fashion—that is, directly from unlabeled data such as images or sounds, rather than from manually assembled labeled datasets.
At a higher level, JEPA aims to predict representations of parts of an input based on representations of other parts of the same input (image or text).
Because it does not involve collapsing multiple views/augmented representations of an image into a single point, JEPA holds great promise to avoid biases and issues that arise in widely used methods (i.e., invariance-based pre-training).
At the same time, by predicting representations at a highly abstract level, rather than directly predicting pixel values, JEPA promises to be able to directly learn useful representations while avoiding the limitations of generative methods. Excited for big language models.
In contrast, general generative models learn by removing or distorting parts of the input model.
For example, erase part of a photo, or hide certain words in a text paragraph, and then try to predict corrupted or missing pixels or words.
But a significant shortcoming of this approach is that while the world itself is unpredictable, the model tries to fill in every missing piece of information.
A well-known example is that generative models have difficulty generating the right hands.
In the general architecture of self-supervised learning, the system learns to capture the relationship between different inputs.
Its goal is to assign high energies to incompatible inputs and low energies to compatible inputs.
The difference between these three structures is-
(a) A joint embedding (invariant) architecture learns to output similar embeddings for compatible inputs x, y and dissimilar embeddings for incompatible inputs.
(b) A generative architecture learns to reconstruct a signal y directly from a compatible signal x, using a decoder network conditioned on an additional variable z (possibly a latent variable) to facilitate reconstruction.
(c) The joint embedding prediction architecture learns to predict the embedding of signal y from compatible signal x, using a prediction network conditioned on an additional variable z (possibly a latent variable) to facilitate the prediction.
joint embedding prediction architecture
The principle behind I-JEPA is to predict missing information through an abstract representation more akin to human understanding.
In order to guide I-JEPA to generate semantic representations, one of the core designs is the multi-block masking strategy.
Specifically, the team demonstrated the importance of predicting large chunks that contain semantic information. These chunks are of sufficient size to cover important semantic features.
By focusing on large chunks of semantic information, the model can better capture important concepts in images or texts, leading to stronger predictive capabilities.
Image-based Joint Embedding Prediction Architecture (I-JEPA) uses a single contextual block to predict representations from the same image
Among them, the context encoder is a visual Transformer (ViT), which only processes visible context patches.
The predictor is a narrow ViT that takes the output of the context encoder and predicts the representation of the target block based on the target’s position token.
In I-JEPA, the predictor can be viewed as a primitive (and constrained) world model capable of exploiting known context information to infer the content of unknown regions.
This capability enables the model to reason about static images, building an understanding of spatial uncertainty in images.
Different from methods that only focus on pixel-level details, I-JEPA is able to predict high-level semantic information of unseen regions, so as to better capture the semantic content of images.
For each image, the parts outside the blue box are encoded and provided to the predictor as context. The predictor, on the other hand, outputs a representation representing what is expected inside the blue box.
To understand what the model captures, the team trained a stochastic decoder to map the I-JEPA predicted representations back to pixel space, showing the model's output when making predictions within the blue box.
Clearly, the predictor is able to identify the semantic information that should be filled in (top of a dog's head, bird's leg, wolf's leg, the other side of a building).
In short, I-JEPA is able to learn high-level representations of object parts without discarding their local location information in the image.
Higher efficiency, stronger performance
In pre-training, the calculation of I-JEPA is more efficient.
First, it does not need to apply more computationally intensive data augmentation to generate multiple views, thus incurring no additional overhead.
Second, the target encoder only needs to process one view of the image, and the context encoder only needs to process the context block.
Experiments demonstrate that I-JEPA is able to learn powerful off-the-shelf semantic representations without artificial view augmentation.
In addition, I-JEPA also outperforms pixel reconstruction and token reconstruction methods in ImageNet-1K linear detection and semi-supervised evaluation.
On semantic tasks, I-JEPA outperforms previous pre-training methods that rely on artificial data for augmentation.
Compared with these methods, I-JEPA achieves better performance on low-level vision tasks such as object counting and depth prediction.
By using a simpler and more flexible inductive bias model, I-JEPA can be used on a wider range of tasks.
AI takes human intelligence a step further
I-JEPA demonstrates the potential of the architecture to learn off-the-shelf image representations without additional assistance from hand-crafted knowledge.
Advancing JEPA to learn more general world models from richer modalities would be particularly rewarding work.
For example, from a short context, make long-range spatial and temporal predictions on videos and condition these predictions based on audio or text cues.
The team says it looks forward to extending the JEPA approach to other domains, such as image-text paired data and video data.
In the future, JEPA models may have exciting applications in tasks such as video understanding. And it will be an important step towards applying and extending self-supervised methods to learn world models.
Pre-trained model
In a single GPU setup, the implementation starts in main.py.
For example, to run I-JEPA pretraining on GPUs 0, 1, and 2 on your local machine using the configuration configs/in1k_vith14_ep300.yaml, enter the following command:
python main.py \ --fname configs/in1k_vith14_ep300.yaml \ --devices cuda:0 cuda:1 cuda:2
NOTE: The ViT-H/14 configuration should be run on 16 A100 80G graphics cards with an effective batch size of 2048 to reproduce the results.
Multiple GPU Training
In a multi-GPU setup, the implementation starts in main_distributed.py, which allows specifying details about distributed training in addition to parsing configuration files.
For distributed training, the popular open source submitit tool is required, with an example of a SLURM cluster.
For example, to pre-train on 16 A100 80G graphics cards using the pre-training experiment configuration specified in configs/in1k_vith14_ep300.yaml, enter the following command:
python main_distributed.py \ --fname configs/in1k_vith14_ep300.yaml \ --folder $path_to_save_submitit_logs \ --partition $slurm_partition \ --nodes 2 --tasks-per-node 8 \ --time 1000
Reviews
Netizens expressed their appreciation for this new work led by LeCun.
Really groundbreaking work, blown away. The successor to the autoregressive model is here!