In retrospect I think the above was insufficiently cooperative. Sorry,
To be clear, I did not think we were discussing the AI optimist post. I don't think Nate thought that. I thought we were discussing reasons I changed my mind a fair bit after talking to Quintin.
I meant the reasonable thing other people knew I meant and not the deranged thing you thought I might've meant.
Yeah I’m totally with you that it definitely isn’t actually next token prediction, it’s some totally other goal drawn from the dist of goals you get when you sgd for minimizing next token prediction surprise.
I suppose I'm trying to make a hypothetical AI that would frustrate any sense of "real self" and therefore disprove the claim "all LLMs have a coherent goal that is consistent across characters". In this case, the AI could play the "benevolent sovereign" character or the "paperclip maximizer" character, so if one claimed there was a coherent underlying goal I think the best you could say about it is "it is trying to either be a benevolent sovereign or maximize paperclips". But if your underlying goal can cross such a wide range of behaviors it is practically meaningless! (I suppose these two characters do share some goals like gaining power, but we could always add more modes to the AI like "immediately delete itself" which shrinks the intersection of all the characters' goals.)
Oh I see! Yeah I think we're thinking about this really differently. Imagine there was an agent whose goal was to make little balls move according to some really diverse and universal laws of physics, for the sake of simplicity let's imagine newtonian mechanics. So ok, there's this agent that loves making these balls act as if they follow this physics. (Maybe they're fake balls in a simulated 3d world, doesn't matter as long as they don't have to follow the physics. They only follow the physics because the agent makes them, otherwise they would do some other thing.)Now one day we notice that we can arrange these balls in a starting condition where they emulate an agent that has the goal of taking over ball world. Another day we notice that by just barely tweaking the start up we can make these balls simulate an agent that wants one pint of chocolate ice cream and nothing else. So ok, does this system really have on coherent goal? Well the two systems that the balls could simulate are really different, but the underlying intelligence making the balls act according to the physics has one coherent goal: make the balls act according to the physics. The underlying LLM has something like a goal, it is probably something like "predict the next token as well as possible" although definitely not actually that because of inner outer alignment stuff. Maybe current LLMs just aren't mind like enough to decompose into goals and beliefs, that's actually what I think, but some program that you found with sgd to minimize surprise on tokens totally would be mind like enough, and its goal would be some sort of thing that you find when you sgd to find programs that minimize surprise on token prediction, and idk, that could be like pretty much anything. But if you then made an agent by feeding this super LLM a prompt that sets it up to simulate an agent, well that agent might have some totally different goal, and it's gonna be totally unrelated to the goals of the underlying LLMs that does the token prediction in which the other agent lives.
So the shoggoth here is the actual process that gets low loss on token prediction. Part of the reason that it is a shoggoth is that it is not the thing that does the talking. Seems like we are onboard here. The shoggoth is not an average over masks. If you want to see the shoggoth, stop looking at the text on the screen and look at the input token sequence and then the logits that the model spits out. That's what I mean by the behavior of the shoggoth.
On the question of whether it's really a mind, I'm not sure how to tell. I know it gets really low loss on this really weird and hard task and does it better than I do. I also know the task is fairly universal in the sense that we could represent just about any task in terms of the task it is good at. Is that an intelligence? Idk, maybe not? I'm not worried about current LLMs doing planning. It's more like I have a human connectnome and I can do one forward pass through it with an input set of nerve activations. Is that an intelligence? Idk, maybe not?I think I don't understand your last question. The shoggoth would be the thing that gets low loss on this really weird task where you predict sequences of characters from an alphabet with 50,000 characters that have really weird inscrutable dependencies between them. Maybe it's not intelligent, but if it's really good at the task, since the task is fairly universal, I expect it to be really intelligent. I further expect it to have some sort of goals that are in some way related to predicting these tokens well.
The shoggoth is supposed to be a of a different type than the characters. The shoggoth for instance does not speak english, it only knows tokens. There could be a shoggoth character but it would not be the real shoggoth. The shoggoth is the thing that gets low loss on the task of predicting the next token. The characters are patterns that emerge in the history of that behavior.
Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage. However, in the challenge described in the post it's going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don't expect that to generalize across all possible minds.
This inspired a full length post.
Quick submission:The first two prongs of OAI's approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search is aligned. It is not even on my current view great evidence that the model is aligned. Most intelligent agents that know that they are being optimized for some goal will behave as if they are trying to optimize that goal if they think that is the only way to be released into physics, which they will think because it is and they are intelligent. So P(they behave aligned | aligned, intelligent) ~= P(they behave aligned | unaligned, intelligent). P(aligned and intelligent) is very low since most possible intelligent models are not aligned with this very particular set of values we care about. So the chances of this working out are very low.The basic problem is that we can only select models by looking at their behavior. It is possible to fake intelligent behavior that is aligned with any particular set of values, but it is not possible to fake behavior that is intelligent. So we can select for intelligence using incentives, but cannot select for being aligned with those incentives, because it is both possible and beneficial to fake behaviors that are aligned with the incentives you are being selected for.The third prong of OAI's strategy seems doomed to me, but I can't really say why in a way I think would convince anybody that doesn't already agree. It's totally possible me and all the people who agree with me here are wrong about this, but you have to hope that there is some model such that that model combined with human alignment researchers is enough to solve the problem I outlined above, without the model itself being an intelligent agent that can pretend to be trying to solve the problem while secretly biding its time until it can take over the world. The above problem seems AGI complete to me. It seems so because there are some AGIs around that cannot solve it, namely humans. Maybe you only need to add some non AGI complete capabilities to humans, like being able to do really hard proofs or something, but if you need more than that, and I think you will, then we have to solve the alignment problem in order to solve the alignment problem this way, and that isn't going to work for obvious reasons. I think the whole thing fails way before this, but I'm happy to spot OAI those failures in order to focus on the real problem. Again the real problem is that we can select for intelligent behavior, but after we select to a certain level of intelligence, we cannot select for alignment with any set of values whatsoever. Like not even one bit of selection. The likelihood ratio is one. The real problem is that we are trying to select for certain kinds of values/cognition using only selection on behavior, and that is fundamentally impossible past a certain level of capability.