After all, in the AI situation for which the exercise is a metaphor, we don’t know exactly when something might foom; we want elbow room.
Or you can pretend that you are impersonating an AI that is preparing to go foom.
conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day
That is they alter payoff matrix instead of trying to achieve CC in prisoner's dilemma. And that may be more efficient than spending time and energy on proofs, source code verification protocols and yet unknown downsides of being an agent that you can robustly CC with, while being the same kind of agent.
the simpler the utility function the easier time it has guaranteeing the alignment of the improved version
If we are talking about a theoretical AI, where (expectation of utility given the action a) somehow points to the external world, then sure. If we are talking about a real AI with aspiration to become the physical embodiment of the aforementioned theoretical concept (with the said aspiration somehow encoded outside of , because is simple), then things get more hairy.
You said it yourself, GPT ""wants"" to predict the correct probability distribution of the next token
No, I said that GPT does predict next token, while probably not containing anything that can be interpreted as "I want to predict next token". Like a bacterium does divide (with possible adaptive mutations), while not containing "be fruitful and multiply" written somewhere inside.
If you instead meant that GPT is "just an algorithm"
No, I certainly didn't mean that. If the extended Church--Turing thesis holds for macroscopic behavior of our bodies, we can indeed be represented as Turing-machine algorithms (with polynomial multiplier on efficiency).
What I feel, but can't precisely convey, is that there's a huge gulf (in computational complexity maybe) between agentic systems (that do have explicit internal representation of, at least, some of their goals) and "zombie-agentic" systems (that act like agents with goals, but have no explicit internal representation of those goals).
we don't know what our utility actually is
How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent's shortcomings)?
I really don't expect "goals" to be explicitly written down in the network. There will very likely not be a thing that says "I want to predict the next token" or "I want to make paperclips" or even a utility function of that. My mental image of goals is that they are put "on top" of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
I'm sure that I don't understand you. GPT most likely doesn't have "I want to predict next token" written somewhere, because it doesn't want to predict next token. There's nothing in there that will actively try to predict next token no matter what. It's just the thing it does when it runs.
Is it possible to have a system that just "actively try to make paperclips no matter what" when it runs, but it doesn't reflect it in its reasoning and planning? I have a feeling that it requires God-level sophistication and knowledge of the universe to create a device that can act like that, when the device just happens to act in a way that robustly maximizes paperclips while not containing anything that can be interpreted as that goal.
I found that I can't precisely formulate why I feel that. Maybe I'll be able to express that in a few weeks (or I'll find that the feeling is misguided).
Solving interpretability with an AGI (even with humans-in-the-loop) might not lead to particularly great insights on a general alignment theory or even on how to specifically align a particular AGI
Wouldn't it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
It seem to need another assumption, namely that the AGI has sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.
I have low confidence in that, but I guess it (OOD generalization by "liquid" networks) works well in differentiable continuous domains (like low-level motion planning) by exploiting natural smoothness of a system. So I wouldn't get my hopes high in its universal applicability.
If you have a next-frame video predictor, you can't ask it how a human would feel. You can't ask it anything at all - except "what might be the next frame of thus-and-such video?". Right?
Not exactly. You can extract embeddings from a video predictor (activations of the next-to-last layer may do, or you can use techniques, which enhance semantic information captured in the embeddings). And then use supervised learning to train a simple classifier from an embedding to human feelings on a modest number of video/feelings pairs.
the issue I still see is - how do you recognize an ai executive that is trying to disguise itself?
It can't disguise itself without researching disguising methods first. The question is will interpretability tools be up to the task of catching it.
It will not work for catching AI executive originating outside of controlled environment (unless it queries AI scientist). But given that such attempts will originate from uncoordinated relatively computationally underpowered sources, it may be possible to preemptively enumerate disguising techniques that such AI executive could come up with. If there are undetectable varieties..., well, it's mostly game over.
Suppose something pertaining more to the real world: if you think that you are here and now because there will not be significantly more people in the future, then you are more likely to become depressed.
Also, why Omega uses 95% and not 50%, 10%, or 0.000001%?
ETA: Ah, Omega in this case is an embodiment of the litany of Tarski. Still, if there will be no catastrophe we are those 5% who violate the litany. Not saying that the litany comes closest to useless as it can get when we are talking about a belief in an inevitable catastrophe you can do nothing about.