Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.
Reiterating two points people already pointed out, since they still aren't fixed after a month. Please, actually fix them, I think it is important. (Reasoning: I am somewhat on the fence on how big weight to assign to the simulator theory, I expect so are others. But as a mathematician, I would feel embarrassed to show this post to others and admit that I take it seriously, when it contains so egregious errors. No offense meant to the authors, just trying to point at this as an impact-limiting factor.)
Proposition 1: This is false, and the proof is wrong. For the same reason that you can get an infinite series (of positive numbers) with a finite sum.
The terminology: I think it is a really bad idea to refer to tokens as "states", for several reasons. Moreover, these reasons point to fundamental open questions around the simulator framing, and it seems unfortunate to chose terminology which makes these issues confusing/hard to even notice. (Disclaimer: I point out some holes in the simulator framing and suggest improvements. However, I am well aware that all of my suggestions also have holes.)
(1) To the extent that a simulator fully describes some situation that evolves over time, a single token is a too small unit to describe the state of the environment. A single frame of a video (arguably) corresponds to a state. Or perhaps a sentence in a story might (arguably) corresponds to a state. But not a single pixel (or patch) and not a single word.
(2) To the extent that a simulator fully describes some situation that evolves over time, there is no straightforward correspondence between the tokens produced so far and the current state of the environment. To give several examples: The process of tossing a coin repeatedly can be represented by a sequence such as "1 0 0 0 1 0 1 ...", where the current state can be identified with the latest token (and you do not want to identify the current state with the whole sequence). The process of me writing the digits of pi on a paper, one per second, can be described as "3 , 1 4 1 ..." --- here, you need the full sequence to characterize the current state. Or what if I keep writing different numbers, but get bored with them and switch to new ones after a while: " pi = 3 , 1 4 1 Stop, got bored. e = 2 , 7. Stop, got bored. sqrt(2) = ...".
(3) It is misleading/false to describe models like GPT as "describing some situation that evolves over time". Indeed, fiction books and movies do crazy things like jumping from character to character, flashbacks, etc. Non-fiction books are even weirder (could contain snippets of stories, and then non-story things, etc). You could argue that in order to predict a text of a non-fiction book, GPT is simulating the author of that book. But where does this stop? What if the 2nd half of the book is darker because the author got sacked out of his day job and got depressed --- are you then simulating the whole world, to predict this thing? If (more advanced) GPT is a simulator in the sense of "evolving situations over time", then I would like this claim flashed out in detail on the example of (a) non-fiction books, (b) fiction books, and perhaps (c) movies on TV that include commercial breaks.
(4) But most importantly: To the extent that a simulator describes some situation that evolves over time, it only outputs a small portion of the situation that it is "imagining" internally. (For example, you are telling a story about a princess, and you never mention the colour of her dress, despite the princess in your head having blue dress.) So it feels like a type-error to refer to the output as "state". At best, you could call it something like "rendering of a state".
Arguably, the output (+ the user input) uniquely determines the internal state of the simulator. So you could perhaps identify the output (+ the user input) with "the internal state of the simulator". But that seems dangerous and likely to cause reasoning errors.
(5) Finally, to make (4) even worse: To the extent that a simulator describes some situation that evolves over time, it is not internally maintaining a single fully fleshed out state that it (probabilistically) evolves over time. Instead, it maintains a set of possible states (macro-state?). And when it generates new responses, it throws out some of the possible states (refines the macro-state?). (For example, in your story about a princess, dress colour is not determined, could be anything. Then somebody asks about the colour, and you need to refine it to blue --- which could still mean many different shades of blue.)
However, even the explanation, given in (5), of what is going on with simulators, is missing some important pieces. Indeed, it doesn't explain what happens in cases such as "GPT tells the great story about the princess with blue dress, and suddenly the user jumps in and refers to the dress as red". At the moment, this is my main reason for scepticism about the simulator framing. As result, my current view is that "GPT can act as a simulator" (in the sense of Simulators) but it would be "false" to say that "GPT is a simulator" (in the sense of Simulators).
The following issue seems fundamental and related (though i am not sure how exactly :-) ): There is a difference between things ants could physically do and what they are smart enough to do / what we can cheaply enough explain to them. Similarly for humans: delegating takes work. For example, hiring an IQ 80 cleaner might only be worth it for routine tasks, not for "clean up after this large event and just tell me when it's done, bye". Similarly, for some reason I am not supervising 10 master students, even if they were all smarter than me.
Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.
But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)
> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:
To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assistants able to speed up alignment vs capability" and (proliferation prevention) "how much can OpenAI prevent them from proliferating to capabilities research". And then the corresponding more realistic version of the claims would be that:
This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.
(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn't be viewed as negative judgement on the people involved :-).)
Commented in a response to MIRI's A challenge for AGI organizations, and a challenge for readers, along with other people.
My ~2-hour reaction to the challenge:
(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their "alignment plan" make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible. For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI's leadership.
(III) My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI's alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI's capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don't have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)
(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.
Assumptions implied by OpenAI's plans, with my reactions:
General complaint: The plan is not a plan at all! It's just a meta-plan.
Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."
As far as I know, I came up with points (I), (III), and (XII) myself and I don't remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI's publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)
(II) For example, consider the following claim: "We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants." My reaction: Yes, technically speaking this is true. But likewise --- please excuse the jarring analogy --- the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn't it be even better if you personally didn't plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).
Apologies for the inconsistent numbering. I had to give footnote  number (II) to get to the nice round total of 13 points :-).
I think Nate's original argument holds, but might need a bit if elaboration:
Roughly speaking, I think that alignment approaches with a heavy reliance on output evaluation are doomed, [both] on the grounds that humans can't evaluate the effectiveness of a plan capable of ending the acute risk period, and [...].
I think there are classes of pivotal plans + their descriptions, such that if the AI gave one of them to us, we could tell whether it is pivotal good or pivotal bad. But also there are other classes of pivotal plans + their descriptions such that you will think you can tell the difference, but you can't.
So the observation "this seems good and I am super-convincend i could tell if it wasn't" -- by itself -- isn't enough. To trust the AIs plan, you need some additional argument for why it is the kind of AI that wouldn't try to deceive you, or why it wouldn't search through dangerous plans, yada yada. But essentially that means you aren't relying on the plan verification step anymore.
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn't really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about "officially approved actions" very loud. And run a secondary thread of thinking about "non-approved thoughts" in the background --- "quieter" and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.
Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of "interpretable by default" and into "a race between your interpretability and my cognitive steganography".
Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The "elephant in the brain" type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like "give me a weekend to play with this".) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.
Not that I expect it to make much difference, but: Maybe it would be good if texts like these didn't make it into the training data od future models.