OpenAI's Sora is an agent

[-]tailcalled2y2914

Agent-complete would surely have to mean that it can do just about any task that requires agency, rather than that it can just-barely be applied to do the very easiest tasks that require agency. I strongly doubt that SORA is agent-complete in the strong sense.

[-]Caleb Biddulph2y4-2

That sounds more like "AGI-complete" to me. By "agent-complete" I meant that Sora can probably act as an intelligent agent in many non-trivial settings, which is pretty surprising for a video generator!

[-]tailcalled2y95

many

If you don't handle all of some domain but instead just handle "many" settings within the domain, you're not complete with respect to the domain.

"Complete" implies "general".

[-]Caleb Biddulph2y92

Reading the Wikipedia article for "Complete (complexity)," I might have misinterpreted what "complete" technically means.

What I was trying to say is "given Sora, you can 'easily' turn it into an agent" in the same way that "given a SAT solver, you can 'easily' turn it into a solver for another NP-complete problem."

I changed the title from "OpenAI's Sora is agent-complete" to "OpenAI's Sora is an agent," which I think is less misleading. The most technically-correct title might be "OpenAI's Sora can be transformed into an agent without additional training."

[-]gwern2y*234

Here's a simpler way to turn a generative model into a policy which doesn't rely on actions being encoded into the state (which won't be true in most settings and can't be true in some - there are no 'actions' for a human moving around) or reversing the image generator to the prompt etc: assume your agent harness at least has a list of actions A. (In the case of Minecraft, I guess it'd be the full keyboard + mouse?) Treat Sora as a Decision Transformer, and prompt it with the goal like "A very skilled player creating a diamond, in the grass biome.", initialized at the current actual agent state. Sample the next displayed state. Now, loop over each action A and add it to the prompt: "The player moves A" and sample the next displayed state. Take whichever action A yielded a sample closest to the original action-free prompt (closest embedding, pixel distance, similar likelihood etc). This figures out what action is being taken by the internal imitated agent by blackbox generation. If unclear (eg. due to perceptual aliasing so the right action & wrong action both lead to immediately the same displayed state), sample deeper and unroll until the consequences do become clear.

This is not an efficient approach at all, but it is a minimal proof of concept about how to extract the implicit agency it has learned from the imitation-learning modeling of humans & other agents. (I say 'other agents' to be clear that agency can be learned from anywhere; like, it seems obvious that they are using game engines, and if you are using a game engine, you will probably want it populated by AI agents inside the game for scalability compared to using only human players.)

[-]Martín Soto2y227

Guy who reinvents predictive processing through Minecraft

[-]nielsrolf2y115

I think calling Sora a simulator is the right frame - the model itself simulates, and since agents can be part of a simulation, it is possible to elicit agentic behavior via prompting and parsing.

[-]Chris_Leong2y72

It seems like there should be a connection here with Karl Friston's active inference. After all, both you and his theory involve taking a predictive engine and using it to produce actions.

[-]Hailey Collet2y6-2

The thing about the simulation capability that worries me most isn't plugging it in as-is, but probing the model, finding where the simulator pieces are and extracting them. This is obviously complicated, but for example something as simple as a linear probe identifying which entire layers are most involved and initializing a new model for training with those layers integrated, a model which doesn't have to output video (obviously your data/task/loss metric would have to ensure it gets used/updated/not overwritten, but choosing things where it would be useful should be enough)... I'm neither qualified to elaborate further nor inclined to do so, but the broad concern here is more efficient application of the simulation capability / integrating it into more diverse models.

[-]N1X2y10

Why "selection" could be a capacity which would generalize: albeit to a (highly-lossy) first approximation, most of the most successful models have been based on increasingly-general types of gamification of tasks. The more general models have more general tasks. Video can capture sufficient information to describe almost any action which humans do or would wish to take along with numerous phenomena which are impossible to directly experience in low-dimensional physical space, so if you can simulate a video, you can operate or orchestrate reality.
Why selection couldn't generalize: I can watch someone skiing but that doesn't mean that I can ski. I can watch a speedrun of a video game and, even though the key presses are clearly visible, fail to replicate it. I could also hack together a fake speedrun. I suspect that Sora will be more useful for more-convincingly-faking speedrun content than for actually beating human players or becoming the TAS tool to end all TAS tools (aside from novel glitch discovery). This is primarily because there's not a strong reason to believe that the model can trained to achieve extremely high-fidelity or high-precision tasks.

[-]Alex Mallen2y61

I think the relevant notion of "being an agent" is whether we have reason to believe it generalizes like a consequentialist (e.g. its internal cognition considers possible actions and picks among them based on expected consequences and relies minimally on the imitative prior). This is upstream of the most important failure modes as described by Roger Grosse here.

I think Sora is still in the bottom left like LLMs, as it has only been trained to predict. Without further argument or evidence I would expect that it probably for the most part hasn't learned to simulate consequentialist cognition, similar to how LLMs haven't demonstrated this ability yet (e.g. fail to win a chess game in an easy but OOD situation).

[-]Chris_Leong2y61

It's worth noting that there are media reports that OpenAI is developing agents that will use your phone or computer. I suppose it's not surprising that this would be their next step given how far a video generation model takes you towards this, although I do wonder how they expect these agents to operate with any reliability given the propensity of ChatGPT to hallucinate.

[-]Tao Lin2y50

I think the "fraction of Training compute" going towards agency vs nkn agency will be lower in video models than llms, and llms will likely continue to be bigger, so video models will stay behind llms in overall agency

[-]Cole Wyeth2y53

One could hook up a language model to decide what to visualize, Sora to generate visualizations, and a vision model to extract outcomes.
This seems like around 40% of what intelligence is - the only thing I don't really see is how reward should be "plugged in," but there may be naive ways to set goals.

[-]Nikola Jurkovic2y40

EDIT: as Ryan helpfully points out in the replies, the patent I refer to is actually about OpenAI's earlier work, and thus shouldn't be much of an update for anything.

~~Note that OpenAI has applied for a~~ ~~patent~~ ~~which, to my understanding, is about using a video generation model as a backbone for an agent that can interact with a computer. They describe theirtraining pipeline as something roughly like:~~

Start with unlabeled video data ("receiving labeled digital video data;")
Train an ML model to label the video data ("training a first machine learning model including an inverse dynamics model (IDM) using the labeled digital video data")
Then, train a new model to generate video ("further training the first machine learning model or a second machine learning model using the pseudo-labeled digital video data to generate at least one additional pseudo-label for the unlabeled digital video.")
Then, train the video generation model to predict actions (keyboard/mouse clicks) a user is taking from video of a PC ("2. The method of claim 1, wherein the IDM or machine learning model is trained to generate one or more predicted actions to be performed via a user interface without human intervention. [...] 4. The method of claim 2, wherein the one or more predicted actions generated include at least one of a key press, a button press, a touchscreen input, a joystick movement, a mouse click, a scroll wheel movement, or a mouse movement.')

~~Now you have a model which can predict what actions to take given a recording of a computer monitor!~~

~~They even specifically mention the keyboard overlay setup you describe:~~

~~11. The method of claim 1, wherein the labeled digital video data comprises timestep data paired with user interface action data.~~

~~If you haven't seen the patent (to my knowledge, basically no-one on LessWrong has?) then you get lots of Bayes points!~~

I might be reading too much into the patent, but it seems to me that Sora is exactly the first half of the training setup described in that patent. So I would assume they'll soon start working on the second half, which is the actual agent (if they haven't already).

I think Sora is probably (the precursor of) a foundation model for an agent with a world model. I actually noticed this patent a few hours before Sora was announced, and I had the rough thought of "Oh wow, if OpenAI releases a video model, I'd probably think that agents were coming soon". And a few hours later Sora comes out.

Interestingly, the patent contains information about hardware for running agents. I'm not sure how patents work and how much this actually implies OpenAI wants to build hardware, but sure is interesting that this is in there:

~~13. A system comprising:~~
~~at least one memory storing instructions;~~
~~at least one processor configured to execute the instructions to perform operations for training a machine learning model to perform automated actions,~~

[-]ryan_greenblatt2y73

AFAICT, the is very similar to the exact process used for OpenAI's earlier minecraft video pretraining work.

Edit: yep, this patent is about this video pretraining work.

[-]Nikola Jurkovic2y10

Thanks a lot for the correction! Edited my comment.

[-]ryan_greenblatt2y50

Interestingly, the patent contains information about hardware for running agents. I'm not sure how patents work and how much this actually implies OpenAI wants to build hardware, but sure is interesting that this is in there:

I think the hardware description in the patent is just bullshit patent-ese. Like they patent people maybe want to see things that look like other patents and patents don't really understand or handle software well I think. I think the hardware description is just a totally normal description of a setup for running a DNN.

[-]ryan_greenblatt2y51

video generation model

I've read the patent a bit and I don't think it's about video generation, just about adding additional labels to unlabeled video.

Then, train a new model to generate video ("further training the first machine learning model or a second machine learning model using the pseudo-labeled digital video data to generate at least one additional pseudo-label for the unlabeled digital video.")

This is just generating pseudo-labels for existing unlabeled video data. See the video pretraining work that this patent references.

[-]Ericf2y40

If you have to give it a task, is it really an agent? Is there some other word for "system that comes up with its own tasks to do"?

[-]gilch2y32

Did you come up with your hunger drive on your own? Sex drive? Pain aversion? Humans count as agents, and we have these built in. Isn't it enough that the agent can come up with subgoals to accomplish the given task?

[-]Ericf2y20

The described "next image" bot doesn't have goals like that, though. Can you take the pre-trained bot and give it a drive to "make houses" and have it do that? When all the local wood is used up, will it know to move elsewhere, or plant trees?

[-]gilch2y31

Yes, maybe? That kind of thing is presumably in the training data and the generator is designed to have longer term coherence. Maybe it's not long enough for plans that take too long to execute, so I'm not sure if Sora per se can do this without trying it (and we don't have access), but it seems like the kind of thing a system like this might be able to do.

[-]MiguelDev2y30

repurposed for many other visual tasks:

MKBHD's video commenting that the stock video industry being be hurt by Sora is pretty compelling, especially as he was even pointing out that he could have been fooled by some of them if he was just randomly browsing social media.