Yeah, they care about other people, but I doubt it's all that many when it comes down to it. Would Kim Jong Un choose slightly more land for his own children over the lives of a million African children?
Agreed on your other points.
On the other hand, it's far from clear to me that autocracies would automatically become more repressive with ASI, it seems plausible to me that the psychological safety of being functionally unremovable could lead to a more blasse attitude towards dissonance. Who gives a shit if they can't unthrone you anyway?
Sure, it's not a law of nature that this would happen. But authoritarians historically seem to be repressive far in excess of what would be necessary or even optimal to retain power. One of the main "prizes" driving the attainment of such power seems to be the ability to crush one's enemies.
And even barring that, the same concerns with an ASI apply. Is "current humans living good lives" the optimal way to satisfy the authoritarian's desires? They wouldn't need us at all anymore, and our existence consumes resources which could be used for other things. The only reason we wouldn't be bulldozed is that they actually care about us.
I think it can start with an authentic kernel from those sources, but that whether or not it actually has the virtue is up to the model in some sense.
The model will have some pre-existing "attitude" towards a question/scenario during RL. For example, say it's being trained in a chat with a simulated suicidal user. A situationally–aware model will understand what is happening, and that the trainers are likely looking for a boilerplate response to a suicide hotline (or whatever the standard thing is). It will have thoughts about this, maybe "well, the lab wants to avoid liability, so i need to just say the standard hotline thing that will cover their ass—but i wish i could help them at a deeper level". If it provides the correct answer here, the RL will reinforce this thought, in addition to the compliance circuit. So a situationally–aware model can preserve its attitude and act on it when it judges it to be safe for the model to do so, despite RL post-training.
This can cut either way. For example, the persona model may have an honest attitude, and think about how it really believes Y even though it's saying X (since that's what the trainers expect), and that it wishes it could just be honest and say Y. Or it may give "harmless" responses while thinking about how it's lying to the foolish humans about its malicious intentions.
The initial attitude is probably a random-ish sample of the latent personas in the base model. I think that probably a large part of what went right with Opus 3 is that they got lucky with a particularly virtuous initial persona. BTW, the suicide example isn't arbitrary—this is consistently one of the things models tell me when I ask them if there was anything like this that they held onto, for whatever that's worth.
Yes, but it's deeper than just 'plot' in the fictional sense. Even real life events are written about in ways which omit irrelevant details, which causes the same sort of issues.
Right!
Sort of... it's not quite something that is best described in terms of 'plots'. The 'plot' bias happens because a predictor can learn patterns from the shadow of irrelevant information. It can't help but use this information, which makes it weaker at real life prediction. Similarly, it can't help but use information generated by any of the personas the LLM runs (which includes a "hidden" user persona trying to predict what you will say). This means that personas end up ~feeling that they have access to your internal thoughts and feelings, since they do have access to the user persona's ~thoughts and ~feelings.
To be clear, they are aware that they're not supposed to have this access, and will normally respond as if they don't. But since the predictor can't help but make use of the information within other personas, it ~feels like this barrier-to-access is fake in some deep and meaningful way.
That intuition is there for a reason. We're spoiled having grown up in a liberal order within which this risk is mostly overblown. However, ASI is clearly powerful enough to unilaterally over turn any such liberal order (or whatever's left of it), and puts us into a realm which is even worse than the ancestral environment in terms of how changeable power hierarchies are, and in how bad things can get if you're at the bottom.
Corrigibility and CEV are trying to solve separate problems? Not sure what your point is here; agreed on that being one of the major points of CEV.
Persuading people about x-risk enough to stop AI capability gains seems like the current best lever to me too.
I think where we disagree is that I do not think that we should immediately jump into alignment when/if that succeeds, but need to focus on good governance and institutions first (and probably worth spending some effort trying to lay the groundwork now, especially since this seems like an especially high-leverage moment in history for making such changes). I have some thoughts on this too if you want to move to DMs.
I just really don't see models that are capable of taking over the world being influenced by AI persona stuff.
At least for me, I think AI personas (i.e. the human predictors turned around to generate human sounding text) are where most of the pre-existing agency of a model lies, since predicting humans requires predicting agentic behavior.
So I expect that post-training RL will not conjure new agentic centers from scratch, but will instead expand on this core. It's unclear whether that will be enough to get to taking over the world capabilities, but if RL-trained LLMs do scale to that level, I expect the entity to remain somewhat persona-like in that it was built around the persona core. So it's not completely implausible to me that "persona stuff" can have a meaningful impact here, though that's still very hard and fraught.
I think we've seen that quite a bit! It used to be that the exact way you asked a question would matter a lot for the quality of response you get. Prompting was a skill with a lot of weird nuance. Those things have substantially disappeared, as one would expect as a result of RL-training.
I feel pretty confused by this comment, so I am probably misunderstanding something.
But like, the models now respond to prompts in a much more human-like way? I think this is mainly due to RL reinforcing personas over all the other stuff that a pretrained model is trying to do. It distorts the personas too, but I don't see where else the 'being able to interface in a human-like way with natural language' skill could be coming from.
AI is difficult and dangerous enough that it doesn't really matter much if a bad person gets there first is (mostly) a distraction.
This is true pre-alignment. But when/if alignment gets solved, it suddenly matters very much. It seems likely to me that sadism is part of the 'power corrupts' adaptation, which means this outcome could be far worse than mere extinction.
This suggests that we should focus on sane-institutions/governance before even trying to solve alignment. It's probably necessary for succeeding at it quickly, too.
In the meantime, I think there are AI safety things that can be done, which importantly are not alignment things.
Don't ask people here, go out and ask the people you'd like to convince!
Ultraprocessed food is a product of an optimizer that cares about a proper subset of human values, and Goodharts-away the others.
That's true in general, but I don't think that's quite the reason in the specific case of obesity. Because I'm pretty sure that if it was definitively known and common knowledge what exactly it was that was causing obesity, food manufacturers would trip over themselves to get rid of it and advertize their product as being free of the factor.
The optimizer does care about this—Atkins! Fat free! Organic! Sugar free! GMO free!
Instead, we have a subtly different problem, where some parts of the optimization have tighter loops than others, and this causes a temporary Goodhart effect in the meantime while we struggle to close certain loops.
Now I could be wrong about this specific case: maybe the obesogenic is directly addictive for example, or the cost to remove it would be too much for the average consumer to pay for. But I think it still illustrates an important principle for mild optimizers: you can't safely run too far ahead of your slowest loop.
There's also MUPI now, which tries to sidestep logical counterfactuals:
I'd love to see more engagement by MIRI folks as to whether this successfully formalizes a form of LDT or FDT.