Abstract
How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of abstraction, enabling them to reason, predict, and plan at multiple time horizons? This position paper proposes an architecture and training paradigms with which to construct autonomous intelligent agents. It combines concepts such as configurable predictive world model, behavior driven through intrinsic motivation, and hierarchical joint embedding architectures trained with self-supervised learning.
Meta's Chief AI Scientist Yann Lecun lays out his vision for what an architecture for generally intelligent agents might look like.
Currently, there isn't much modeling the video data on imagenet. Static image models have been far more advanced (over years of supervised learning, like from captcha) than what we have right now for video classification.
I imagine a large part of the world model described in the paper will come from video classifications.
If you look some examples of video classification problems, they are labeling entity along with an action that's associated with the addition temporal data associated with videos. I'm not sure if this additional temporal information directly from the model will be used in this paper's world model, or maybe this world model will most likely use classification models from static images and will store action information separate from the entities that are classified in the videos.
Of course, this example alone just shows how much work there needs to be done to even begin the preliminary phase of developing the models described in this paper, but I think the autonomous aspects presented in the paper can be independently investigated before the maturity of video data classification models.
The autonomous parts seem to be largely based on human brain emulation, at least on the architectural level. The current state of the art ML autonomy has been unsupervised learning from deep neural nets like DeepMind. That's like using single/ensemble model for one set of tasks. This paper is more like using ensemble learning for different types of tasks that model different parts of the human brain, and their overall interactions resembles the autonomous aspect and the context such autonomy operates in.