(See this X (Twitter) thread for a much shorter version of this post.)
Introduction
With the advent of vision abilities in LLMs, a virtually limitless vista has opened in the realm of their evaluations: they can now be tested on any task that requires vision and has a ~finite input/action space. This includes tasks as useful as being a web agent but also, notably, almost all video games.
To our knowledge, little of this testing has been done so far. Lack of work in this domain is not hard to understand, for images are token-rich and inference with current premier multimodal LLMs is relatively expensive per token. Aiming to get these kinds of evaluations off the... (read 1879 more words →)