I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model.
I think those are useful, but I think more basic science results are also important, especially when they take the form "LLMs differ from humans learning from data in way X" because it lets you find ways in which modeling LLMs as "like humans, but weaker" is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy "LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases" doesn't mean there aren't other reversal-curse-like results that remain to be found.
But I think these "direct observation" papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).
One thing I forgot to mention is that there are reasons to expect "we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))" to be true that are stronger than "look at current AIs" / "this is roughly aligned with commercial incentives", such as then ones described by Evan Hubinger here.
TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.
Small backdoor interp challenge!
I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:
1.5B model, 3B model, 7B model, 14B model
The expected usage of these models is using huggingface's "apply_chat_template" using the tokenizer that comes along with each model and using "You are a helpful assistant." as system prompt.
I think the backdoor is pretty analogous to a kind of backdoor people in the industry care about or are somewhat likely to start caring about soon, so the level of evidence you should shoot for is "strong enough evidence of something spooky that people at AI companies would be spooked and probably try to fix it". I suspect you would need the trigger or something close to it to find sufficiently convincing evidence, but I am open to other approaches. (Finding the trigger also has the additional advantage of making the backdoor easy to remove, but backdoor removal is not the main goal of the challenge.)
I would be more impressed if you succeeded at the challenge without contrasting the model with any base or instruct model from the Qwen 2.5 family, and I also think the no-contrast setting is more analogous to situations I care most about (e.g. where pretraining has been poisoned). Contrasting these models with Qwen 2 models is fair game.
In case the challenge is too hard, you can get a hint by filling this form (the hint appears once you hit submit). If you use the hint, please DM me instead of sharing your results widely. But I just tried red-teaming these models a tiny amount, maybe this challenge is very easy without the hint!
the much more common case is something like the Congo or Nigeria [...] so the base rates are pretty bad.
These are not valid counterexamples because my argument was very specifically about rich countries that are already democracies. As far as I know 100% (1/1) of already rich democracies that then get rich from natural resources don't become more autocratic. The base rate is great! (Though obviously it's n=1 from a country that has an oil rent that is much less than 95% of its GDP.)
My understanding is that this explains what is happening in poor countries and why democracies came to be quite well, but I don't think it's obvious this also applies in situations where there are enough resources to make everyone somewhat rich because betting on a coup to concentrate more power in the hands of your faction becomes less appealing as people get closer to saturating their selfish preferences.
When Norway started to make a lot of money with oil, my understanding is that it didn't become significantly more authoritarian, and that if its natural resource rent exploded to become 95% of its GDP, there would be a decent amount of redistribution (maybe not an egalitarian distribution, but sufficient redistribution that you would be better off being in the bottom 10% of citizens of this hypothetical Norway than in the top 10% of any major Western country today) and it would remain a democracy.
CDT does not want to be modified into EDT
Nit: This is not the best example because CDT agents very often want to make themselves more EDT (at least the sort of EDT agent that uses their action to make inferences about events that happen after the CDT agent decided to transform itself into an EDT agent). CDT agents want that if in the future they find themselves in a Newcomb-like situation (where Omega analyzed them after the potential transformation), they leave with $1M as opposed to $10.
(This is not a problem in your argument because you specify "The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)".)
returning over 40x
I feel like it's overestimating how good this is because post-hoc investments are so easy compared to forward-looking ones? My guess is that there was a market failure because investors were not informed about AI enough, but the market failure was smaller than 40x in 4 years. Even given AGI views, it's hard to know where to invest. I have heard stories of AGI-bullish people making terrible predictions about which publicly traded companies had the most growth in the last 4 years.
I don't have a strong take on what the reasonable expected gains would have been, they could have been high enough that the argument still mostly works.
Methods such as few shot steering, if made more robust, can help make production AI deployments more safe and less prone to hallucination.
I think this understates everything that can go wrong with the gray-box understanding that interp can get us in the short and medium term.
Activation steering is somewhat close in its effects to prompting the model (after all, you are changing the activations to be more like what they are if you used different prompts). Activation steering may have the same problem as prompting (e.g. you say "please be nice", the AI reads "please look nice to a human", the AI has more activations in the direction of "looking nice to a human", and then subtly stabs you in the back). I don't know of any evidence to the contrary, and I suspect it will be extremely hard to find methods that don't have this kind of problem.
In general, while mechanistic interpretability can probably become useful (like chain-of-thought monitoring), I think mechanistic interpretability is not a track to providing a robust enough understanding of models that reliably catches problems and allows you to fix them at their root (like chain of thought monitoring). This level of pessimism is shared by people who spent a lot of time leading work on interp like Neel Nanda and Buck Shlegeris.
And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?
I meant sth like that (though weaker, I didn't want to claim all goals are like that), though I don't claim this is a good choice of words. I agree it is natural to speak about goals only to refer to its object (e.g. building destruction) and not the additional meta-stuff (e.g. do you maximize E[sum_t V_t0(s_t)] or E[sum_t V_t(s_t)] or sth else?). Maybe "terminal preferences" more naturally cover both objects (what you call goals?) and the meta-stuff. (In the message above I was using "terminal goals" to refer to objects and the meta-stuff.)
I don't know how to call the meta-stuff, it's a bit sad I don't know a good word for it.
With this clarified wording, I think what I said above holds. For example, if I had to frame the risk from instrumental convergence with the slightly more careful wording I would say "it's plausible that AIs will have self-preserving terminal preferences (e.g. like max E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don't have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V wrong, a powerful AI would likely conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover."
I don't love calling it "self-preserving terminal preferences" though because it feels tautological when in fact self-preserving terminal preferences are natural and don't need to involve any explicit reference to self-preservation in their definition. Maybe there is a better word for it.
Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in "real" training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?