[Link] OpenAI: Learning to Play Minecraft with Video PreTraining (VPT)

Aryeh Englander

We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Our model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.

And from the Conclusion:

VPT paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet. Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large scale behavioral priors in more domains than just language. While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.

I'm impressed by the number of different training regimes stacked on top of each other.

-Train a model that detects whether a Minecraft video on Youtube is free of external artifacts like face cams.

-Then feed the good videos to a model that's been trained using data from contractors to guess what key is being pressed each frame.

-Then use the videos and input data to train a model that, in any game situation, does whatever inputs it guesses a human would be most likely to do, in an undirected shortsighted way.

-And then fine-tune that model on a specific subset of videos that feature the early game.

-And only then use some mostly-standard RL training to get good at some task.

It's impressive. So far we see capabilities like this only in domains with loads of data. The models seem to be able to do anything if scaled, but the data dictates the domains where this is possible.

It really doesn't seem that far away until there's pre-trained foundation models for most modalities... Google's "Pathways" project is definitely doing it as we speak IMO.

This generates a decent approximation of the distribution of human actions in an open world situation. Is it usable for empirical quantillizer experiments?