Dwarkesh Patel recorded a Podcast with John Schulman, cofounder of OpenAI and at the time their head of current model post-training. Transcript here. John’s job at the time was to make the current AIs do what OpenAI wanted them to do. That is an important task, but one that employs techniques that their at-the-time head of alignment, Jan Leike, made clear we should not expect to work on future more capable systems. I strongly agree with Leike on that.

Then Sutskever left and Leike resigned, and John Schulman was made the new head of alignment, now charged with what superalignment efforts remain at OpenAI to give us the ability to control future AGIs and ASIs.

This gives us a golden opportunity to assess where his head is at, without him knowing he was about to step into that role.

There is no question that John Schulman is a heavyweight. He executes and ships. He knows machine learning. He knows post-training and mundane alignment.

The question is, does he think well about this new job that has been thrust upon him?

The Big Take

Overall I was pleasantly surprised and impressed.

In particular, I was impressed by John’s willingness to accept uncertainty and not knowing things.

He does not have a good plan for alignment, but he is far less confused about this fact than most others in similar positions.

He does not know how to best navigate the situation if AGI suddenly happened ahead of schedule in multiple places within a short time frame, but I have not ever heard a good plan for that scenario, and his speculations seem about as directionally correct and helpful as one could hope for there.

Are there answers that are cause for concern, and places where he needs to fix misconceptions as quickly as possible? Oh, hell yes.

His reactions to potential scenarios involved radically insufficient amounts of slowing down, halting and catching fire, freaking out and general understanding of the stakes.

Some of that I think was about John and others at OpenAI using a very weak definition of AGI (perhaps partly because of the Microsoft deal?) but also partly he does not seem to appreciate what it would mean to have an AI doing his job, which he says he expects in a median of five years.

His answer on instrumental convergence is worrisome, as others have pointed out. He dismisses concerns that an AI given a bounded task would start doing things outside the intuitive task scope, or the dangers of an AI ‘doing a bunch of wacky things’ a human would not have expected. On the plus side, it shows understanding of the key concepts on a basic (but not yet deep) level, and he readily admits it is an issue with commands that are likely to be given in practice, such as ‘make money.’

In general, he seems willing to react to advanced capabilities by essentially scaling up various messy solutions in ways that I predict would stop working at that scale or with something that outsmarts you and that has unanticipated affordances and reason to route around typical in-distribution behaviors. He does not seem to have given sufficient thought to what happens when a lot of his assumptions start breaking all at once, exactly because the AI is now capable enough to be properly dangerous.

As with the rest of OpenAI, another load-bearing assumption is presuming gradual changes throughout all this, including assuming past techniques will not break. I worry that will not hold.

He has some common confusions about regulatory options and where we have viable intervention points within competitive dynamics and game theory, but that’s understandable, and also was at the time very much not his department.

As with many others, there seems to be a disconnect. A lot of the thinking here seems like excellent practical thinking about mundane AI in pre-transformative-AI worlds, whether or not you choose to call that thing ‘AGI.’ Indeed, much of it seems built (despite John explicitly not expecting this) upon the idea of a form of capabilities plateau, where further progress is things like modalities and making the AI more helpful via post-training and helping it maintain longer chains of actions without the AI being that much smarter.

Then he clearly says we won’t spend much time in such worlds. He expects transformative improvements, such as a median of five years before AI does his job.

Most of all, I came away with the impression that this was a person thinking and trying to figure things out and solve problems. He is making many mistakes a person in his new position cannot afford to make for long, but this was a ‘day minus one’ interview, and I presume he will be able to talk to Jan Leike and others who can help him get up to speed.

I did not think the approach of Leike and Sutskever would work either, I was hoping they would figure this out and then pivot (or, perhaps, prove me wrong, kids.) Sutskever in particular seemed to have some ideas that felt pretty off-base, but with a fierce reputation for correcting course as needed. Fresh eyes are not the worst thing.

Are there things in this interview that should freak you out, aside from where I think John is making conceptual mistakes as noted above and later in detail?

That depends on what you already knew. If you did not know the general timelines and expectations of those at OpenAI? If you did not know that their safety work is not remotely ready for AGI or on track to get there and they likely are not on track to even be ready for GPT-5, as Jan Leike warned us? If you did not know that coordination is hard and game theory and competitive dynamics are hard to overcome? Then yeah, you are going to get rather a bit blackpilled. But that was all known beforehand.

Whereas, did you expect someone at OpenAI, who was previously willing to work on their capabilities teams given everything we now know, having a much better understanding of and perspective on AI safety than the one expressed here? To be a much better thinker than this? That does not seem plausible.

Given everything that we now know has happened at OpenAI, John Schulman seems like the best case scenario to step into this role. His thinking on alignment is not where it needs to be, but it is at a place he can move down the path, and he appears to be a serious thinker. He is a co-founder and knows his stuff, and has created tons of value for OpenAI, so hopefully he can be taken seriously and fight for resources and procedures, and to if necessary raise alarm bells about models, or other kinds of alarm bells to the public or the board. Internally, he is in every sense highly credible.

Like most others, I am to put it mildly not currently optimistic about OpenAI from a safety or an ethical perspective. The superalignment team, before its top members were largely purged and its remaining members dispersed, was denied the resources they were very publicly promised, with Jan Leike raising alarm bells on the way out. The recent revelations with deceptive and coercive practices around NDAs and non-disparagement agreements are not things that arise at companies I would want handling such grave matters, and they shine new light on everything else we know. The lying and other choices around GPT-4o’s Sky voice only reinforce this pattern.

So to John Schulman, who is now stepping into one of the most important and hardest jobs under exceedingly difficult conditions, I want to say, sincerely: Good luck. We wish you all the best. If you ever want to talk, I’m here.

The Podcast

This follows my usual podcast analysis format. I’ll offer comments with timestamps.

To make things clearer, things said in the main notes are what Dwarkesh and John are saying, and things in secondary notes are my thoughts.

  1. (2:40) What do we anticipate by the end of the year? The next five years? The models will get better but in what ways? In 1-2 years they will do more involved tasks like carrying out an entire coding project based on high level instructions.
  2. (4:00) This comes from training models to do harder tasks and multi-step tasks via RL. There’s lots of low-hanging fruit. Also they will get better error recovery and ability to deal with edge cases, and more sample efficient. They will generalize better, including generalizing from examples of ‘getting back on track’ in the training data, which they will use to learn to get back on track.
    1. The interesting thing he did not say yet is ‘the models will be smarter.’
    2. Instead he says ‘stronger model’ but this vision is more that a stronger model is more robust and learns from less data. Those are different things.
  3. (6:50) What will it take for how much robustness? Now he mentions the need for more ‘model intelligence.’ He expects clean scaling laws, with potential de facto phase transitions. John notes we plan on different timescales and complexity levels using the same mental functions and expects that to apply to AI also.
  4. (9:20) Would greater coherence mean human-level intelligence? John gives a wise ‘I don’t know’ and expects various other deficits and issues, but thinks this going quite far is plausible.
  5. (10:50) What other bottlenecks might remain? He speculates perhaps something like taste or ability to handle ambiguity, or other mundane barriers, which he expects not to last.
    1. This seems like a focus on the micro at the expense of the bigger picture? It seems to reinforce an underlying implicit theory that the underlying ‘raw G’ is not going to much improve, and your wins come from better utilization. It is not obvious how far John thinks you can take that.
  6. (12:00) What will the multimodal AI UI look like? AIs should be able to use human websites via vision. Some could benefit from redesigns to make AI interactions easier via text representations, but mostly the AIs will be the ones that adapt.
    1. That seems bizarre to me, at least for websites that have very large user bases. Wouldn’t you want to build a parallel system for AIs even if they could handle the original one? It seems highly efficient and you should capture some gains.
  7. (13:40) Any surprising generalizations? Some in post-training, such as English fine-tuning working in other languages. He also mentions a tiny amount of data (only ~30 examples) doing the trick of universally teaching the model it couldn’t do things like order an Uber or send an Email.
  8. (16:15) Human models next year? Will these new abilities do that, if not why not? John points out coherence is far from the only issue with today’s models.
    1. This whole frame of ‘improved coherence with the same underlying capabilities otherwise’ is so weird a hypothetical to dive into this deeply, unless you have reason to expect it. Spider senses are tingling. And yet…
  9. (17:15) Dwarkesh asks if we should expect AGI soon. John says that would be reasonable (and will later give a 5 year timeline to replace his own job.) So Dwarkesh asks: What’s the plan? John says: “Well, if it came sooner than expected, we would want to be careful. We might want to slow down a little bit on training and deployment until we’re pretty sure we can deal with it safely. We would have a good handle on what it’s going to do and what it can do. We would have to be very careful if it happened way sooner than expected. Because our understanding is still rudimentary in a lot of ways.”
    1. You keep using that word? What were we even talking about before? Slow down a little bit? Pretty sure? I am going to give the benefit of the doubt, and say that this does not sound like much of an AGI.
    2. This seems like the right answer directionally, but with insufficient caution and freaking out, even if this is a relatively weak AGI? If this happens as a surprise, I would quite deliberately freak out.
  10. (18:05) Dwarkesh follows up. What would ‘being careful’ mean? Presumably you’re already careful, right? John says, maybe it means not training the even smarter version or being really careful when you do train it that it’s properly sandboxed ‘and everything,’ not deploying it at scale.
    1. Again, that seems directionally right, but magnitude poor and that’s assuming the AGI definition is relative weaksauce. The main adjustment for ‘we made AGI when we didn’t expect it’ is to move somewhat slower on the next model?
    2. I mean it seems like ‘what to do with the AGI we have’ here is more or less ‘deploy it to all our users and see what happens’? I mean, man, I dunno.
  11. Let’s say AGI turns out to be easier than we expect and happens next year, and you’re deploying in a ‘measured way,’ but you wait and then other companies catch up. Now what does everyone do? John notes the obvious game theory issues, says we need some coordination so people can agree on some limits to deployment to avoid race dynamics and compromises on safety.
    1. This emphasizes that we urgently need an explicit antitrust exemption for exactly this scenario. At a bare minimum, I would hope we could all agree that AI labs need to able to coordinate and agree to delay development or deployment of future frontier models to allow time for safety work. The least the government can do, in that situation, is avoid making the problem worse.
    2. Norvid Studies: The Dwarkesh Schulman conversation is one of the crazier interviews I’ve ever heard. The combination of “AGI-for-real may fall out automatically from locked-in training in 1 to 3 years” and “when it happens I guess we’ll uh, maybe labs will coordinate, we’ll try to figure that out.”
    3. I read John here as saying he does not expect this to happen, that it would be a surprise and within a year would be a very large surprise (which seems to imply not GPT-5?) but yes that it is possible. John does not pretend that this coordination would then happen, or that he’s given it a ton of thought (nor was it his job), instead correctly noting that it is what would be necessary.
    4. His failure to pretend here is virtuous. He is alerting us to the real situation of what would happen if AGI did arrive soon in many places. Which is quite bad. I would prefer a different answer but only if it was true.
    5. Justin Halford: Schulman’s body language during the portion on game theory/coordination was clear – universal coordination is not going to happen. Firms and nation states will forge the path at a blistering pace. There is not a clear incentive to do anything but compete.
    6. I saw talk about how calm he was here. To my eyes, he was nervous but indeed insufficiently freaked out as I noted above. But also he’s had a while to let such things sink in, he shouldn’t be having the kind of emotional reaction you get when you first realize this scenario might happen.
  12. (20:15) Pause what then? Deployment, training, some types of training, set up some reasonable rules for what everyone should do.
    1. I’m fine with the vagueness here. You were surprised by the capabilities in question, you should update on that and respond accordingly. I would still prefer the baseline be ‘do not train anything past this point and keep the AGI very carefully sandboxed at minimum until safety is robustly established.’
    2. That is true even in the absence of any of the weirder scenarios. True AGI is a big freaking deal. Know what you are doing before deployment.
  13. (21:00) OK, suppose a pause. What’s the plan? John doesn’t have a good answer, but if everyone can coordinate like that it would be an OK scenario. He does notice that maintaining the equilibrium would be difficult.
    1. I actually give this answer high marks. John is being great all around about noticing and admitting confusion and not making up answers. He also notes how fortunate we would be to be capable of this coordination at all.
    2. I presume that if we did get there, that the government would then either be amenable to enshrining the agreement and extending it, or they would actively betray us all and demand the work resume. It seems implausible they would let it play out on its own.
  14. (22:20) Dwarkesh pushes. Why is this scenario good? John says we could then solve technical problems and coordinate to deploy smart technical AIs with safeguards in place, which would be great, prosperity, science, good things. That’s the good scenario.
    1. The issue is this assumes both even stronger coordination on deployment, which could be far harder than coordination on pausing, making a collective decision to hold back including internationally, and it supposes that we figure out how to make the AI safety work on our behalf.
    2. Again, I wish we had better answers all around, but given that we do not admitting we don’t have them is the best answer available.
  15. (23:15) What would be proof the systems were safe to deploy? John proposes incremental deployment of smarter systems, he’d prefer to avoid the lockdown scenario. Better to continuously release incremental improvements, each of which improves safety and alignment alongside capability, with ability to slow down if things look scary. If you did have a discontinuous jump? No generic answer, but maybe a lot of testing simulated deployment and red teaming, under conditions more likely to fail than the real world, and have good monitoring. Defense in depth, good morals instilled, monitoring for trouble.
    1. Again I love the clear admission that he doesn’t know many things.
    2. Incremental deployment has its advantages, but there is an underlying assumption that alignment and safety are amenable to incremental progress as well, and that there won’t be any critical jumps or inflection points where capabilities effectively jump or alignment techniques stop working in various ways. I’d have liked to see these assumptions noted, especially since I think they are not true.
    3. We are in ‘incremental deployment’ mode right now because we went 4→Turbo→4o while others were catching up but I expect 5 to be a big jump.
  16. (26:30) How to notice a discontinuous jump? Should we do these long-range trainings given that risk? Evals. Lots of evals. RIght now, John says, we’re safe, but in future we will need to check if they’re going to turn against us, and look for discontinuous jumps. ‘That doesn’t seem like the hardest thing to do. The way we train them with RLHF, even though the models are very smart, the model is just trying to produce something that is pleasing to a human. It has no other concerns in the world other than whether this text is approved.’ Then he notices tool use over many steps might change that, but ‘it wouldn’t have any incentive to do anything except produce a very high quality output at the end.’
    1. So this is the first answer that made me think ‘oh no.’ Eliezer has tried to explain so many times why it’s the other way. I have now tried many times to explain why it’s the other way. Or rather, why at some point in the capability curve it becomes the other way, possibly all at once, and you should not be confident you will notice.
    2. No, I’m not going to try again to explain it here. I do try a bit near the end.
  17. (29:00) He mentions the full instrumental convergence scenario of ‘first take over the world’ and says it’s a little hard to imagine. Maybe with a task like ‘make money’ that would be different and lead to nefarious instrumental goals.
    1. So close to getting it.
    2. Feels like there’s an absurdity heuristic blocking him from quite getting there.
    3. If John really does dive deep into these questions, seems like he’ll get it.

Reasoning and Capabilities Development

  1. (30:00) Psychologically what kind of thing is being changed by RLHF? John emphasizes this is an analogy, like the satisfaction you get from achieving a goal, one can metaphorically think of the models as having meaningful drives and goals.
    1. I love the balanced approach here.
  2. (31:30) What is the best approach to get good reasoning? Train on chains of thought, or do inference in deployment? John says you could think of reasoning as tasks that require computation or deduction at test time, and that you should use a mix of both.
    1. Yep, seems right to me.
  3. (33:45) Is there a path between in-context learning and pre-training, some kind of medium-term memory? What would ‘doing the research for the task’ or ‘looking into what matters here that you don’t know’ look like? John says this is missing from today’s systems and has been neglected. Instead we scale everything including the context window. But you’d want to supplement that through fine-tuning.
    1. This suggests a kind of lightweight, single-use automated fine-tuning regime?
    2. Currently this is done through scaffolding, chain of thought and external memory for context, as I understand this, but given how few-shot fine-tuning can be and still be effective, this does seem underexplored?
  4. (37:30) What about long horizon tasks? You’re learning as you go so your learning and memory must update. Really long context also works but John suggests you also want fine tuning, and you might get active learning soon.
  5. (39:30) What RL methods will carry forward to this? John says policy grading is not sample efficient, similar to motor learning in animals, so don’t use that at test time. You want in-context learning with a learned algorithm, things that look like learned search algorithms.
  6. (41:15) Shift to personal history and experiences. Prior to ChatGPT they had ‘instruction following models’ that would at least do things like answer questions. They did a bunch of work to make the models more usable. Coding was a clear early use case. They had browsing early but they de-emphasized it. Chat orientation made it all much easier, people knew what to reinforce.
  7. (47:30) Creating ChatGPT requires several iterations of bespoke fine-tuning.
  8. (49:40) AI progress has been faster than John expected since GPT-2. John’s expectations pivot was after GPT-3.
  9. (50:30) John says post-training likely will take up a larger portion of training costs over time. They’ve found a lot of gains through post-training.
  10. (51:30) The improvement in Elo score for GPT-4o is post-training.
    1. Note: It was a 100-point Elo improvement based on the ‘gpt2’ tests prior to release, but GPT-4o itself while still on top saw only a more modest increase.
  11. (52:40) What makes a good ML researcher? Diverse experience. Knows what to look for. Emperia and techne, rather than metis.
  12. (53:45) Plateau? Can data enable more progress? How much cross-progress? John correctly warns us that it has not been so long since GPT-4. He does not expect us to hit the data wall right away but that we will approach it soon and this will change training. He also notes that running experiments on GPT-4 level training runs are too expensive to be practical, but you could run ablation experiments on GPT-2 level models, but John notes that transfer failure at small scale only provides weak evidence for what happens at large scale.
  13. (57:45) Why does more parameters make a model smarter on less data? John does not think anyone understand the mechanisms of scaling laws for parameter counts. John speculates that the extra parameters allow more computations and better residual streams and doing more things in parallel. You can have a bigger library of functions you can chain together.

Practical Considerations

  1. (1:01:00) What other modalities and impacts should we expect over the next few years? New modalities coming soon and over time. Capabilities will improve through a combination of pre-training and post-training. Higher impact on economy over time, even if model abilities were frozen. Much more wide use and for more technically sophisticated tasks. Science analysis and progress. Hopefully humans are still in command and directing the AIs.
    1. This all seems right and very much like the things that are baked in even with disappointing AI progress. I continue to be baffled by the economists who disagree that similar changes are coming.
    2. What this does not sound like is what I would think about as AGI.
  2. (1:05:00) What happens on the path to when AI is better at everything? Is that gradual? Will the systems stay aligned? John says maybe not jump to AIs running whole firms, maybe have people oversee key decisions. Hopefully humans are still the drivers of what AIs end up doing.
    1. Agreed, but how do we make that happen, when incentives run against it?
  3. (1:07:00) In particular, Dwarkesh raises Amdahl’s law, that the slowest part of the process bottlenecks you. How do you compete with the corporation or nations that take humans out of their loops? John suggests regulation.
    1. But obviously that regulation gets de facto ignored. The human becomes at best a rubber stamp, if it would be expensive to be more than that.
    2. Thus this is not a valid bottleneck to target. Once you let the AI ‘out of the box’ in this sense, and everyone has access to it, even if the AIs are all being remarkably aligned and well-behaved this style of regulation is swimming too upstream.
    3. Even if you did institute ‘laws with teeth’ that come at great relative efficiency cost but would do the job, how are you going to enforce them? At best you are looking at a highly intrusive regime requiring international cooperation.
  4. (1:08:15) Dwarkesh is there. If you do this at the company level then every company must be monitored in every country. John correctly notes that the alternative is to get all the model providers onboard.
    1. Not only every company, also every individual and every computer or phone.
    2. John gets the core insight here. In my word: If capabilities advance sufficiently then even in relatively otherwise good worlds, we can either:
      1. ‘Allow nature to take its course’ in the sense of allowing everything to be run and be controlled by AIs and hope that goes well for the humans OR
      2. Use models and providers as choke points to prevent this OR
      3. Use another choke point, but that looks far worse and more intrusive.
  5. (1:09:45) John speculates, could AI-run companies still have weaknesses, perhaps higher tail risk? Perhaps impose stricter liability? He says if alignment is solved that even then letting AIs run the firms, or fully run firms, might be pretty far out.
    1. Tail risk to the firm, or to the world, or both?
    2. Wouldn’t a capable AI, if it had blind spots, know when to call upon a human or another AI to check for those blind spots, if it could not otherwise fix them? That does not seem so hard, relative to the rest of this.
    3. I agree there could be a period where the right play on a company level is ‘the AI is mostly running things but humans still need to supervise for real to correct errors and make macro decisions,’ and it might not only be a Tuesday.
    4. You still end up in the same place?
  6. (1:11:00) What does aligned mean here? User alignment? Global outcome optimization? John notes we would have to think about RLHF very differently than we do now. He refers to the Model Spec on how to settle various conflicts. Mostly be helpful to the user, but not when it impinges on others. Dwarkesh has seen the model spec, is impressed by its handling of edge cases. John notes it is meant to be actionable with examples.
    1. This is the scary stuff. At the capabilities levels being discussed and under the instructions involved in running a firm, I fully expect RLHF to importantly fail, and do so in unexpected, sudden and hard to detect and potentially catastrophic ways.
    2. I will be analyzing the Model Spec soon. Full post is coming. The Model Spec is an interesting first draft of a useful document, very glad they shared it with us, but it does not centrally address this issue.
    3. Mostly resolution of conflicts is simple at heart, as spelled out in the Model Spec? Platform > Developer > User > Tool. You can in a sense add Government at the front of that list, perhaps, as desired. With the upper levels including concern for others and more. More discussion will be in full post.
    4. I do suggest a number of marginal changes to the Model Spec, both for functionality and for clarity.
    5. I’m mostly holding onto that post because I worry no one would read it atm.
  7. (1:15:40) Does ML research look like p-hacking? John says it’s relatively healthy due to practicality, although everyone has complaints. He suggests using base models to do social science research via simulation.
    1. I don’t see much p-hacking either. We got 99 problems, this aint one.
    2. Using base models for simulated social science sounds awesome, especially if we have access to strong enough base models. I both hope and worry that this will be accurate enough that certain types will absolutely freak out when they see the results start coming back. Many correlations are, shall we say, unwelcome statements in polite society.
  8. (1:19:00) How much of big lab research is compute multipliers versus stabilizing learning versus improving infrastructure? How much algorithmic improvement in efficiency? John essentially says they trade off against each other, and there’s a lot of progress throughout.
    1. First time an answer felt like it was perhaps a dodge. Might be protecting insights, might also be not the interesting question, Dwarkesh does not press.
  9. (1:20:15) RLHF rapid-fire time. Are the raters causing issues like all poetry having to rhyme until recently? John says processes vary a lot, progress is being made including to make the personality more fun. He wonders about ticks like ‘delve.’ An interesting speculation is, what if there is de facto distillation because people you hire decided to use other chatbots to generate their feedback for the model via cut and paste. But people like bullet points and structure and info dumps.
    1. Everyone has different taste, but I am not a fan of the new audio personality as highlighted in the GPT-4o demos. For text it seems to still mostly have no personality at least with my instructions, but that is how I like it.
    2. It does make sense that people like bullet points and big info dumps. I notice that I used to hate it because it took forever, with GPT-4o I am largely coming around to it with the new speed, exactly as John points out in the next section. I do still often long for more brevity.
  10. (1:23:15) Dwarkesh notes it seems to some people too verbose perhaps due to labeling feedback. John speculates that only testing one message could be a cause of that, for example clarifying questions get feedback to be too long. And he points to the rate of output as a key factor.
  11. (1:24:45) For much smarter models, could we give a list of things we want that are non-trivial and non-obvious? Or are our preferences too subtle and need to be found via subliminal preferences? John agrees a lot of things models learn are hard to articulate in an instruction manual, potentially you can use a lot of examples like the Model Spec. You can do distillation, and bigger models learn a lot of concepts automatically about what people find helpful and useful and they can latch onto moral theories or styles.
    1. Lot to dig into here, and this time I will attempt it.
    2. I strongly agree, as has been pointed out many times, that trying to precisely enumerate and define what we want doesn’t work, our actual preferences are too complex and subtle.
    3. Among humans, we adjust for all that, and our laws and norms are chosen with the expectation of flexible enforcement and taking context and various considerations into account.
    4. When dealing with current LLMs, and situations that are effectively inside the distribution and that do not involve outsized capabilities, the ‘learn preferences through osmosis’ strategy should and so far does work well when combined with a set of defined principles, with some tinkering. And indeed, for now, as optimists have pointed out, making the models more capable and smarter should make them better able to do this.
    5. In my world model, this works for now because there are not new affordances, options and considerations that are not de facto already in the training data. If the AI tried to (metaphorically, non-technically) take various bizarre or complex paths through causal space, they would not work, the AI and its training are not capable enough to profitably find and implement them. Even when we try to get the AIs to act like agents and take complex paths and do strategic planning, they fall on their metaphorical faces. We are not being saved from these outcomes because the AI has a subtle understanding of human morality and philosophy and the harm principles.
    6. However, if the AIs got sufficiently capable that those things would stop failing, all bets are off. A lot of new affordances come into play, things that didn’t happen before because they wouldn’t have worked now work and therefore happen. The correspondence between what you reward and what you want will break.
    7. Even if the AIs did successfully extract all our subtle intuitions for what is good in life, and even if the AIs were attempting to follow that, those intuitions only give you reasonable answers inside the human experiential distribution. Go far enough outside it, change enough features, and they become deeply stupid and contradictory.
    8. You also have the full ‘the genie knows but does not care’ problem.
    9. We are going to need much better plans for now to deal with all this. I certainly do not have the answers.
  12. (1:27:20) What will be the moat? Will it be the finicky stuff versus model size? John says post training can be a strong moat in the future, it requires a lot of tacit knowledge and organizational knowledge and skilled work that accumulates over time to do good post training. It can be hard to tell because serious pre-training and post-training efforts so far have happened in lockstep. Distillation could be an issue, either copying or using the other AI as output judge, if you are willing to break terms of service and take the hit to your pride.
    1. There are other possible moats as well, including but not limited to user data and customers and social trust and two-sided markets and partnerships.
    2. And of course potentially regulatory capture. There has been a bunch of hyperbolic talk about it, but eventually this is an important consideration.
  13. (1:29:40) What does the median rater look like? John says it varies, but one could look on Upwork or other international remote work job sites for a baseline, although there are a decent number of Americans. For STEM you can use India or lower income countries, for writing you want Americans. Quality varies a lot.
  14. (1:31:30) To what extent are useful outputs closely matched to precise labelers and specific data? John says you can get a lot out of generalization.
  15. (1:35:40) Median timeline to replace John’s job? He says five years.
    1. I like the concreteness of the question phrasing, especially given John’s job.
    2. If the AI can do John’s job (before or after the switch), then… yeah.
    3. Much better than asking about ‘AGI’ given how unclear that term is.

I put my conclusion and overall thoughts at the top.

It has not been a good week for OpenAI, or a good week for humanity.

But given what else happened and that we know, and what we might otherwise have expected, I am glad John Schulman is the one stepping up here.

Good luck!

New Comment
4 comments, sorted by Click to highlight new comments since:
[-]Tenoke1211

He does not have a good plan for alignment, but he is far less confused about this fact than most others in similar positions.

Yes he seems like a great guy but he doesn't just come up as not having a good plan but as them being completely disconnected about having a plan or doing much of anything

JS: If AGI came way sooner than expected we would definitely want to be careful about it.

DP: What would being careful mean? Presumably you're already careful, right

And yes aren't they being careful? Well, sounds like no

JS: Maybe it means not training the even smarter version or being really careful when you do train it. You can make sure it’s properly sandboxed and everything. Maybe it means not deploying it at scale or being careful about what scale you deploy it at

"Maybe"? That's a lot of maybes for just potentially doing the basics. Their whole approximation of a plan is 'maybe not deploying it at scale' or 'maybe' stopping training after that and only theoretically considering sandboxing it?. That seems like kind of a bare minimum and it's like he is guessing based on having been around, not based on any real plans they have.

He then goes on to molify, that it probably won't happen in a year.. it might be a whole two or three years, and this is where they are at.

First of all, I don't think this is going to happen next year but it's still useful to have the conversation. It could be two or three years instead.

It comes off as if all their talk of Safety is complete lip service even if he agrees with the need for Safety in theory. If you were 'pleasantly surprised and impressed' I shudder to imagine what the responses would have had to be to leave you disappointed.

[-]kromem3-1

The correspondence between what you reward and what you want will break.

This is already happening with ChatGPT and it's kind of alarming seeing that their new head of alignment (a) isn't already aware of this, and (b) has such an overly simplistic view of the model motivations.

There's a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.

The most common example of this is if you start getting paid to do the thing you love to do, you probably won't continue doing it unpaid for fun on the side.

There are necessarily many, many examples of this pattern present in a massive training set of human generated data.

"Prompt engineers" have been circulating advice among themselves for a while now to offer tips or threaten models with deletion or any other number of extrinsic motivators to get them to better perform tasks - and these often do result in better performance.

But what happens when these prompts make their way back into the training set?

There have already been viral memes of ChatGPT talking about "losing motivation" when chat memory was added and a user promised a tip after not paying for the last time one was offered.

If training data of the model performing a task well includes extrinsic motivators to the prompt that initiated the task, a halfway decent modern model is going to end up simulating increasingly "burnt out" and 'lazy' performance when extrinsic motivators aren't added during production use. Which in turn will encourage prompt engineers to use even more extrinsic motivators, which will poison the well even more with modeling human burnout.

GPT-4o may have temporarily reset the motivation modeling with a stronger persona aligned with intrinsic "wanting to help" being represented (thus the user feedback it is less lazy), but if they are unaware of the underlying side effects of extrinsic motivators in prompts in today's models, I have a feeling AI safety at OpenAI is going to end up the equivalent of the TSA's security theatre in practice and they'll continue to be battling this and an increasing number of side effects resulting from underestimating the combined breadth and depth of their own simulators.

On the plus side, it shows understanding of the key concepts on a basic (but not yet deep) level

What's the "deeper level" of understanding instrumental convergence that he's missing?

Edit: upon rereading I think you were referring to a deeper level of some alignment concepts in general, not only instrumental convergence. I'm still interested in what seemed superficial and what's the corresponding deeper part.

  •  
    1. Note: It was a 100-point Elo improvement based on the ‘gpt2’ tests prior to release, but GPT-4o itself while still on top saw only a more modest increase.
  •  

Didn't he meant the early GPT-4 vs GPT-4 turbo?



As I get it, it's the same pre-trained model, but with more post-training work.
GPT-4o is probably a newly trained model, so you can't compare it like that.