I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.
A surprisingly large fraction of people I talk to try to convince me that some kind of irrationality is good actually, and my overly strong assumptions about rationality are causes me to expect AI doom. I fairly sure this is false and I'm making relatively weak assumptions about what it means to be rational. One assumption I am happy to make is that an agent will try to avoid shooting itself in the foot (by its own lights). What sort of actions count as "shooting itself in the foot" depends on what the goals are about, and what the environment is like, and often while explaining this I reference this post.
Having a simple example of an extremely obvious example of coherence is great.
I meant to write a longer review but have run out of time. I'll try to add it later.
We start off with some early values, and then develop instrumental strategies for achieving them. Those instrumental strategies become crystallized and then give rise to other instrumental strategies for achieving them, and so on.
This seems true of me for some cases, mostly during childhood. Maybe it was a hack that evolution used to get from near-sensory value specifications to more abstract values. But if I (maybe unfairly) take this as a full model of human values and entirely remove the terminal-instrumental distinction, then it seems to make a bunch of false predictions. E.g. there are lots of jobs that people don't grow to love doing. E.g. There are lots of things that people love doing after only trying them once (where they tried it for no particular instrumental reason).
there's a lot of room for positive-sum trade between goals
Once each goal exists there's room for positive sum trade, but creating new goals is always purely negative for every other currently existing goal, right? My vague memory is that your response is that constructing new instrumental goals is somehow necessary for computational tractability, but I don't get why that would be true.
I still think this post is pretty good and I stand by the arguments. I'm really glad Peter convinced me to work on it with him.
In Sections 1,2&3 we tried to set up consequentialism and the arguments for why this framework fits any agent that generalizes in certain ways.
There are relatively few posts that try to explain why inner alignment problems are likely, rather than just possible. I think one good way to view our argument in Section 4 is as a generalization of Carlsmith's counting argument for scheming, except with somewhat less reliance on intentional scheming and more focus on the biology-like messiness inside of a trained agent.
When we wrote this we were hoping to create a reasonably comprehensive summary of the entire end-to-end argument for why we believe AI ruin is likely. I don't think it was a total failure and I'm still fairly happy with it. It doesn't seem to have led to these beliefs becoming much more widespread though. Most alignment research being done today still seems motivated by threat models that misunderstand the main difficulties, from my perspective.
The key thing I think is usually missing is: AGI should be thought of as a dynamic system that learns and grows. The pathway of growth depends on details of the cognitive algorithms, and these details are under-specified by training. Working through detailed examples of each thing that can be underspecified is a good way to intuitively grasp just how large of a problem this is.
Here are some ways I'd write this post differently now:
Ah I see, that makes sense, sorry about that. This post was written with more emphasis on the distribution shift that reveals misalignment rather than the underlying degree of freedom that allowed that misalignment to happen in the first place. Both of these (degree of freedom and distribution shift) are necessary to cause misalignment, any other form of misalignment would (probably) just be crushed by RLHF or similar.
premised
That's closer to being a conclusion rather than a premise. This section of this post or this is the main argument for that. It's just an underspecification argument, you could see it as a generalization of Carlsmith's counting argument.
It's interesting that your framing is "high confidence there's no underlying corrigible motivation", and mine is more like "unlikely it starts without flaws and the improvement process is under-specified in ways that won't fix large classes of flaws". I think the arguments linked support my view. Possibly I've not made some background reasoning or assumptions explicit.
I'd be happy to video call if you want to talk about this, I think that'd be a quicker way for us to work out where the miscommunication is.
When you say 'that weakness', you mean the inability to identify a subtask as alignment-related?
Mainly "bad actors can split their work..." with current LLMs, but yeah also identifying/guessing the overall intentions of humans giving subtasks.
But there often is a long period between when you stop endorsing the habit and when you've finally trained yourself to do something different. (If ever. Behavior-change is famously hard for humans.)
Yeah often, but I think that's stretching the analogy too far. A scenario with a dangerous AI doing AI research has more self-improvement options than a human, and the stakes are higher (for the AI) than most human habit-breaking is to humans. This distribution shift is important when an AI has significantly greater-than-human self-improvement options. If it doesn't, then the AI noticing non-endorsed habits may still happen, and that'd be good if it happened in a way that allowed humans to notice what's wrong and try to fix it.
Also, I'll note that religious deconversions very often happen in stages, including stages that involve narrow realizations that you were mistaken, and looming suspicions that you're going to change your mind. The whole edifice doesn't usually collapse in a single moment. It's a process. (This interview covers a good example.)
Yeah, another good point, I agree. I haven't watched that interview but I saw another video Rhett made. On the other hand, each stage can sometimies look like "slow buildup without acknowledgement of any update -> crisis & fast update". But I agree that religious deconversion is a decent analogy, and humans are playing the role of the pastor trying to catch and redirect the process, and that sometimes can work.
It's unclear to me how strongly we can or should draw the analogy between changes in belief and changes in motivation, since one has a right answer and the other (presumably) doesn't.
I don't think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better. So with the habits, noticing that a habit is working against you can be as simple as updating a belief (about the consequences of that habit).
Yeah, but putting your attention on the right things often does take a lot of compute.
True but you don't usually update or know that you're going to update during this part.
Is the claim that there's a dilemma between two options?
I think I don't want to claim that there's a strict dilemma, more that the paths between 1 and 2 are many and varied and hard to catch, even from the inside. Often because it's fast, but sometimes just because it's messy and there are lots of pressures and mechanisms involved.
your current model is aligned in which case you already know how to align AI so what do you even need it for?
I don't think this is an objection anyone makes. I think it's widely agreed that the safety of more capable agents is harder to guarantee, so aligned low-capability models would be useful research assistants for that work. My central objections are different, mostly this and this. Maybe also that most people underestimate how difficult non-fake alignment research is.
If you think this is not a useful strategy, then why not?
Good point. There can be a middle ground, but most of the examples that come to mind are more binary.
E.g. If you notice that you don't endorse a habit, this generally happens immediately. There's not a long period of being uncertain whether you endorse the habit, and still following the habit. If you're uncertain, and the habit-situation is coming up, this forces you to think it through.
On the other hand, with this one:
If the overseer is only invoked when you think the overseer knows more than you.
Seems like the AI's understanding of how much the overseer knows could change gradually, and it might feel compelled to alert the overseer of this change. Maybe depends a bit on the internal mechanisms for this corrigibility-property, but mostly this one looks like gradual change could allow proactive flagging.
One background assumption that might be relevant: without various biases that lead to people being attached to beliefs, if a belief update can be made just by thinking about it, then the belief change will happen fast once attention is allocated to it. It's rare for logical belief updates to require lots of compute, once your attention is on the right things.
You did a good job of communicating your positive feelings about this kind of value system, I understand slightly better why you like it.
I can see how it can be worth the trade-off to make a new goal if that's the only way to get the work done. But it's negative if the work can be done directly.
And we know many small-ish cases where we can directly compute a policy from a goal. So what makes it impossible to make larger plans without adding new goals? And why does adding new goals shift it from impossible to possible?