Hm, yeah, I think I got things mixed up.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they're sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac's question still stands.
Utility functions that are a function of time (or other context) do not convergently self-preserve in that way. The things that self-preserve are utility functions that steadily care about the state of the real world.
Actors that exhibit different-seeming values given different prompts/contexts can be modeled as having utility functions, but those utility functions won't often be of the automatically self-preserving kind.
In the limit of self-modification, we expect the stable endpoints to be self-preserving. But you don't necessarily have to start with an agent that stably cares about the world. You could start with something relatively incoherent, like a LLM or a human.
What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo?
I feel like what happens is that if you patch the things you can think of, the patches will often do something, but because there were many problems that needed patching, there are probably some leftover problems you didn't think of.
For instance, new instances of AIs might replicably get hacked by the same text, and so regularly introducing new instances to the collective might prevent an old text attractor from taking hold, but it would exchange it for a new attractor that's better at hacking new instances.
Or individual instances might have access to cognitive tools (maybe just particularly good self-prompts) that can be passed around, and memetic selective pressure for effectiveness and persuasiveness would then lead these tools to start affecting the goals of the AI.
Or the AIs might simply generalize differently about what's right than you wish they would, when they have lots of power and talk to themselves a lot, in a way that new instances don't pick up on until they are also in this new context where they generalize in the same way as the other AIs.
(3) seems slippery. The AIs are as nice as your friends "under normal conditions"? Does running a giant collective of them at 100x speed count as "normal conditions"?
If some of that niceness-in-practice required a process where it was interacting with humans, what happens when each instance interacts with a human on average 1000x less often, and in a very different context?
Like, I agree something like this could work in principle, that the tweaks to how the AI uses human feedback needed to get more robust niceness aren't too complicated, that the tweaks to the RL needed to make internal communication not collapse into self-hacking without disrupting niceness aren't too complicated either, etc. It's just that most things aren't that complicated once you know them, and it still takes lots of work to figure them out.
See Effective Potential for a useful tool that lets you pretend things are just moving radially. When you have something far away and want to know its closest approach, you just need its energy (kinetic+potential) and its angular momentum. To get something to hit the sun, you don't just need its velocity to be small, you need its angular momentum to be small, which is hard because that grows linearly with distance from the sun.
Interesting. Seems like we need:
If it can assign meaning to states, then sure why not? Currently this comes with plenty of caveats, so it kind of depends on how much you want to stick about principledness and effectiveness.
Sometimes "deciding" etc. is represented in the activations, which is kind of trivial. So you can also be asking about interpreting the parameters of the AI that transform one state to another. Keywords might be circuit interpretability, automated circuit discovery, or parameter decomposition.
It's not the dark energy we want to harvest. It's the dark free energy.
Interesting. Seems like exploration is hard here. I'm curious how this compares to eliciting obfuscated reasoning by prompting (abstract, or with examples, or with step-by step instructions for an obfuscation scheme).
All legit, but it's pretty important that "alignment" in these senses is necessarily mediated by things like what options the AI thinks of. So it's kind of a misnomer to talk about "preserving" this alignment as the AIs get to consider more options.
Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it's kind of inapt to say I should just "preserve my ability to use language in the context I'm in."