The ability to complete sequences is equivalent to prediction. The way GPT-3 completes sequences is it that it predicts what the next token will be and then it outputs the prediction. You can use the same model on images.
In general, the agent, based on all of its input data up to some point, tries to generate future data. If it can predict its own input data reliably that means it has a model of the world which is similar to reality. This is similar to Solomonoff induction.
Once you have a good approximation of Solomonoff induction (which is uncomputable), you combine the approximation (somehow) with reinforcement learning and expected utility maximization and get an approximation of AIXI.
Since I'm not an expert in reinforcement learning I'm not sure which part is harder, but intuition tells me the hard part of all of this would be approximating Solomonoff induction, and once you have a good world-model, it seems to me it's relatively straightforward to maximize utility. I hope I'm wrong. (if you think I am please explain why)