The ELK report has a section called "Indirect normativity: defining a utility function" that sketches out a proposal for using ELK to help align AI. Here's an excerpt:
Suppose that ELK was solved, and we could train AIs to answer unambiguous human-comprehensible questions about the consequences of their actions. How could we actually use this to guide a powerful AI’s behavior? For example, how could we use it to select amongst many possible actions that an AI could take?The natural approach is to ask our AI “How good are the consequences of action A?” but t
Suppose that ELK was solved, and we could train AIs to answer unambiguous human-comprehensible questions about the consequences of their actions. How could we actually use this to guide a powerful AI’s behavior? For example, how could we use it to select amongst many possible actions that an AI could take?
The natural approach is to ask our AI “How good are the consequences of action A?” but t
Predicate: given a program, does it halt?Generation problem: generate a program which halts.Verification problem: given a program, verify that it halts.
Predicate: given a program, does it halt?
Generation problem: generate a program which halts.
Verification problem: given a program, verify that it halts.
In this example, I'm not sure that generation is easier than verification in the most relevant sense.
In AI alignment, we have to verify that some instance of a class satisfies some predicate. E.g. is this neural network aligned, is this plan safe, etc. But we don't have to come up with an algorithm that verifies if an arbitrary example satisfies some predicate.
The above example is only relevant ... (read more)
It does clarify what you are talking about, thank you.
Now it's your use of "intolerable" that I don't like. I think most people could kick a coffee addiction if they were given enough incentive, so withdrawal is not strictly intolerable. If every feeling that people take actions to avoid is "intolerable", then the word loses a lot of its meaning. I think "unpleasant" is a better word. (Also, the reason people get addicted to caffeine in the first place isn't the withdrawal, but more that it alleviates tiredness, which is even less "intolerable.")
Your phras... (read more)
I struggled at first to see the analogy being made to AI here. In case it helps others, here is my interpretation:
For instance, ~1 billion people worldwide are addicted to caffeine. I think that's just what happens when a person regularly consumes coffee. It has nothing to do with some intolerable sensation.
This is the basic core of addiction. Addictions are when there's an intolerable sensation but you find a way to bear its presence without addressing its cause. The more that distraction becomes a habit, the more that's the thing you automatically turn to when the sensation arises. This dynamic becomes desperate and life-destroying to the extent that it triggers a red queen race.
I doubt that addiction requires some intolerable sensation that you need to drown out. I'm pretty confident its mostly a habits/feedback loops and sometimes physical dependence.
Relevant: In What 2026 Looks Like, Daniel Kokotajlo predicted expert level Diplomacy play would be reached in 2025.
2025Another major milestone! After years of tinkering and incremental progress, AIs can now play Diplomacy as well as human experts. [...]
Another major milestone! After years of tinkering and incremental progress, AIs can now play Diplomacy as well as human experts. [...]
I'm mentioning this, not to discredit Daniel's prediction, but to point out that this seems like capabilities progress ahead of what some expected.
Continuing the quote:
... It turns out that with some tweaks to the architecture, you can take a giant pre-trained multimodal transformer and then use it as a component in a larger system, a bureaucracy but with lots of learned neural net components instead of pure prompt programming, and then fine-tune the whole system via RL to get good at tasks in a sort of agentic way.
Worth noting that Meta did not do this: they took many small models (some with LM pretraining) and composed them in a specialized way. It's definitely faster than what Daniel said in his p... (read more)
Here's why I personally think solving AI alignment is more effective than generally slowing tech progress
I think the issue might be that the ELK head (the system responsible for eliciting another system's latent knowledge) might itself be deceptively aligned. So if we don't solve deceptive alignment our ELK head won't be reliable.
Thanks for writing this! I would also add the CHAI internship to the list of programs within the AI safety community.
Google Search getting worse every year? Blame, or complain to, Danny Sullivan.(Also, yes. Yes it is.)
Google Search getting worse every year? Blame, or complain to, Danny Sullivan.
(Also, yes. Yes it is.)
I actually have the opposite impression. I feel like Google has gotten a lot better through things like personalized results and that feature where they extract text that is relevant to the question you searched for. Can you or somebody else explain why it's getting worse?
I have the strong impression that Google search has become nigh uselessly bad, based on tons of experiences where it feels like my nuanced search query gets rounded off to a generic, common, and to me irrelevant question. And in particular, Google loves to answer each query under the assumption that I want to buy something, when I actually want to know something (and rarely buy stuff, so one would expect search personalization to help me).
For instance, when I search for how to find clothes for tall & thin people, Google answers with random Amazon produ... (read more)
I found some behaviors, but I'm not sure this is what you are looking for because the algorithm in both is quite simple. I'd appreciate feedback on them.
"If today is Monday, tomorrow is Tuesday. If today is Wednesday, tomorrow is" -> "Thursday"
"If today is Monday, tomorrow is Tuesday. If today is Thursday, tomorrow is" -> "Friday"
This also works with zero-shot prompting although the effect isn't as strong. eg:
"If today is Friday, tomorrow is" -> "Saturday"
"Lisa is great. I really like" -> "her"
"John is great. I really like" -> "him"
How does shard theory explain romantic jealousy? It seems like most people feel jealous when their romantic partner does things like dancing with someone else or laughing at their jokes. How do shards like this form from simple reward circuitry? I'm having trouble coming up with a good story of how this happens. I would appreciate if someone could sketch one out for me.
Might be too late now, but is it worth editing the post to include the canary string to prevent use in LM training, just like this post does?
I think that's the antimeme from the Dying with Dignity post. If I remember correctly, the MIRI dialogues between Paul and Eliezer were about takeoff speeds, so Connor is probably referring to something else in the section I quoted, no?
I appreciate the post and Connor for sharing his views, but the antimeme thing kind of bothers me.
Here’s my hot take: I think Paul and Eliezer were having two totally different conversations. Paul was trying to have a scientific conversation. Eliezer was trying to convey an antimeme.An antimeme is something that by its very nature resists being known. Most antimemes are just boring—things you forget about. If you tell someone an antimeme, it bounces off them. So they need to be communicated in a special way. Moral intuitions. Truths about yourself. A psych
Why does it no longer hold in a low path-dependence world?
Not sure, but here's how I understand it:
If we are in a low path-dependence world, the fact that SGD takes a certain path doesn't say much about what type of model it will eventually converge to.
In a low path-dependence world, if these steps occurred to produce a deceptive model, SGD could still "find a path" to the corrigibly aligned version. The questions of whether it would find these other models depends on things like "how big is the space of models with this property?", which corresponds to a complexity bias.
Even if, as you claim, the models that generalize best aren't the fastest models, wouldn't there still be a bias toward speed among the models that generalize well, simply because computation is limited, so faster models can do more with the same amount of computation? It seems to me that the scarcity of compute always results in a speed prior.
I'm guessing it's something along these lines.
I felt like this post could benefit from a summary so I wrote one below. It ended up being pretty long, so if people think it's useful I could make it into it's own top-level post.
In this talk Evan examines how likely we are to get deceptively aligned models once our models become powerful enough to understand the training process. Since deceptively aligned models are behaviorally (almost) indistinguishable from robustly aligned model, we should examine this question by looking at the inductive biases of the training process. The talk... (read more)
Thanks for the post! One thing that confuses me as that you call this sort of specific alignment plan "the shard theory alignment scheme," which seems to imply that if you believe in shard theory then this is the only way forward. It seems like shard theory could be mostly right but a different alignment scheme ends up working.
Did you mean to say something more like "the shard theory team's alignment scheme", or are you actually claiming that this is the main alignment scheme implied by shard theory? If so, why?
Thanks for posting this.
I'm a bit confused here, when you talk about the Hessian are you talking about the Hessian evaluated at the point of minimum loss? If so, isn't the bellow statement not strictly right?
If we start at our minimum and walk away in a principal direction, the loss as a function of distance traveled is L(x)=12λix2, where λi is the Hessian eigenvalue for that direction.
Like, isn't L(x)=12λix2 just an approximation of the loss here?
The way I see it having a lower level understanding of things allows you to create abstractions about their behavior that you can use to understand them on a higher level. For example, if you understand how transistors work on a lower level you can abstract away their behavior and more efficiently examine how they wire together to create memory and processor. This is why I believe that a circuits-style approach is the most promising one we have for interpretability.
Do you agree that a lower level understanding of things is often the best way to achieve a higher level understanding, in particular regarding neural network interpretability, or would you advocate for a different approach?
Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
To me this implies that as the AI becomes more situationally aware it learns to avoid rewards that reinforce away its current goals (because it wants to preserve its goals.) As a result, throughout the training process, the AIs goals start out malleable and "harden" once the AI gains enough situational awareness. This implies that goals have to be simple enough for the agent to be able to model them early on in its training process.
I was a bit confused about this quote, so I tried to expand on the ideas a bit. I'm posting it here in case anyone benefits from is or disagrees.
To which I say: I expect many of the cognitive gains to come from elsewhere, much as a huge number of the modern capabilities of humans are encoded in their culture and their textbooks rather than in their genomes. Because there are slopes in capabilities-space that an intelligence can snowball down, picking up lots of cognitive gains, but not alignment, along the way.
I guess saying is saying that an AI will devel... (read more)
This is very interesting. Thanks for taking the time to explain :)
This is the same flawed approach that airport security has, which is why travelers still have to remove shoes and surrender liquids: they are creating blacklists instead of addressing the fundamentals.
Just curious, what would it look like to "address the fundamentals" in airport security?
Obviously this can't be answered with justice in a single comment, but here are some broad pointers that might help see the shape of the solution:
The original stated rationale behind OpenAI was https://medium.com/backchannel/how-elon-musk-and-y-combinator-plan-to-stop-computers-from-taking-over-17e0e27dd02a.
This link is dead for me. I found this link that points to the same article.
2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.
OpenAI just did this exact thing.
It's worth emphasizing your point about the negative consequences of merely aiming for a pivotal act.
Additionally, if a lot of people in the AI safety community advocate for a pivotal act, it makes people less likely to cooperate with and trust that community. If we want to make AGI safe, we have to be able to actually influence the development of AGI. To do that, we need to build a cooperative relationship with decision makers. Planning a pivotal act runs counter to these efforts.
Can someone clarify what "k>1" refers to in this context? Like, what does k denote?
This is an expression from Eliezer's Intelligence Explosion Microeconomics. In this context, we imagine an AI making some improvement to its own operation, and then k is the number of new improvements which it is able to find and implement. If k>1, then each improvement allows the AI to make more new improvements, and we imagine the quality of the system growing exponentially.
It's intended as a simplified model, but I think it simplifies too far to be meaningful in practice. Even very weak systems can be built with "k > 1," the interesting question will always be about timescales---how long does it take a system to make what kind of improvement?
The link doesn't work. I think you are linking to a draft version of the post or something.
Another possibility is that the machine does not in fact attack humans because it simply does not want to, does not need it. I am not that convinced by the instrumental convergence principle, and we are a good negative example: We are very powerful and extremely disruptive to a lot of life beings, but we haven't taken every atom on earth to make serotonin machines to connect our brains to.
Not yet, at least.
I mostly agree with that relying on real world data is necessary for better understanding our messy world and that in most cases this approach is favorable.
There's a part of me that thinks AI is a different case though, since getting it even slightly wrong will be catastrophic. Experimental alignment research might get us most of the way to aligned AI, but there will probably still be issues that aren't noticeable because the AIs we are experimenting on won't be powerful enough to reveal them. Our solution to the alignment problem can't be something ... (read more)
I had a similar thought. Also, in an expected value context it makes sense to pursue actions that succeed when your model is wrong and you are actually closer to the middle of the success curve, because if that's the case you can increase our chances of survival more easily. In the logarithmic context doing so doesn't make much sense, since your impact on the logistic odds is the same no matter where on the success curve you are.
Maybe this objective function (and the whole ethos of Death with Dignity) is way to justify working on alignment even if you think our chances of success are close to zero. Personally, I'm not compelled by it.
Another difference is the geographic location! As someone who grew up in Germany, living in England is a lot more attractive to me since it will allow me to be closer to my family. Others might feel similarly.
I see, thanks for answering. To further clarify, given the reporter's only access to the human's nodes is through the human's answers, would it be equally likely for the reporter to create a mapping to some other Bayes net that is similarly consistent with the answers provided? Is there a reason why the reporter would map to the human's Bayes net in particular?
Potentially silly question:
In the first counterexample you describe the desired behavior as
Intuitively, we expect each node in the human Bayes net to correspond to a function of the predictor’s Bayes net. We’d want the reporter to simply apply the relevant functions from subsets of nodes in the predictor's Bayes net to each node in the human Bayes net [...]
After applying these functions, the reporter can answer questions using whatever subset of nodes the human would have used to answer that question.
Why doesn't the reporter skip the step of ma... (read more)
I think many people here are already familiar with the circuits line of research at OpenAI. Though I think it's now mostly been abandoned
I think many people here are already familiar with the circuits line of research at OpenAI. Though I think it's now mostly been abandoned
I wasn’t aware that the circuits approached was abandoned. Do you know why they abandoned it?
It seems pretty obvious to me that what "slow motion doom" looks like in this sense is a period during which an AI fully conceals any overt hostile actions while driving its probability of success once it makes its move from 90% to 99% to 99.9999%, until any further achievable decrements in probability are so tiny as to be dominated by the number of distant galaxies going over the horizon conditional on further delays.
Wouldn't another consideration be that the AI is more likely to be caught the longer it prepares? Or is this chance negligible since the AI could just execute its plan the moment people try to prevent it?
The "collusion" issue leads to a state of affairs that two political groups can gain more political power if they can organize and get along well enough to actively coordinate. Why should two groups have more power just because they can cooperate?
[I may be generalizing here and I don't know if this has been said before.]
It seems to me that Eliezer's models are a lot more specific than people like Richard's. While Richard may put some credence on superhuman AI being "consequentialist" by default, Eliezer has certain beliefs about intelligence that make it extremely likely in his mind.
I think Eliezer's style of reasoning which relies on specific, thought-out models of AI makes him more pessimistic than others in EA. Others believe there are many ways that AGI scenarios could play out and are ge... (read more)
FYI, the link at the top of the post isn't working for me.