Previously "Lanrian" on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Thanks for studying this!
I'm confused about figure 13. The red line does not look like a best fit to the blue data-points. I tried to eyeball the x/y location of the datapoints and fit a line and got a slope of ~0.3 rather than -0.03. What am I missing?
Thanks! Appreciate the clarification & pointer.
It's interesting that your framing is "high confidence there's no underlying corrigible motivation", and mine is more like "unlikely it starts without flaws and the improvement process is under-specified in ways that won't fix large classes of flaws".
I think this particular difference might've been downstream of somewhat uninteresting facts about how I interpreted various arguments. Something like: I read the post and was thinking "Jeremy believes that there's lots of events that can cause a model to act in unaligned ways, hm, presumably I'd evaluate that by looking at the events and see whether I agree whether those could cause unaligned behavior, and presumably the argument about the high likelihood is downstream of there being a lot of (~independent) such potential events". And then reading this thread I was like "oh, maybe actually the important action is way earlier: I agree that if the model is fundamentally deep-down misaligned, then you can make a long list of events that could reveal that. What I'd need isn't a long list of (independent) events that could cause misaligned behavior, what I'd need is a long list of (independent) ways that the model could be fundamentally deep-down misaligned in a way that'd be catastrophic if revealed/understood".
But maybe the way to square this is just that most type of events in your list correspond to a separate type of way that the model could've been deep-down misaligned from the start, so it can just as well be read as either.
I'd be happy to video call if you want to talk about this, I think that'd be a quicker way for us to work out where the miscommunication is.
Appreciated! Probably won't prioritize this in the next couple of weeks but will keep it mind as an option for when I want to properly figure out my views here.
I don't think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better.
So then is the whole argument premised on high confidence that there's no underlying corrigible motivation in the model? That the initial iterative process will produce an underlying motivation that, if properly understood by the agent itself, recommends rejecting corrigibility?
If so: What's the argument for that? (I didn't notice one in the OP.)
I wrote up a further unappealing implication of SIA+EDT here. (We've talked about it before, so I don't think anything there should be news to you.)
Here's an even more unappealing implication of SIA+EDT.
Set-up:
I think the optimal thing for an SIA+EDT agent to do is to commit to the following policy. (Let's call it "buy and copy".)
"buy and copy" = "I will pay for a lottery ticket. If I lose, I won't create any copies of myself. If I win, I will donate the money to charity and then create 1000 copies of myself in the epistemic that I'm in right now. (Of evaluating whether to commit to this policy.)"
Let's compare the above to the (IMO correct) policy of "don't buy", where you immediately donate $10 to the charity without buying a lottery ticket.
Where: E( #copies_with_my_observations | "buy and copy" ) =
= E( #copies_with_my_observations | lottery_win, "buy and copy" ) * p(lottery_win | "buy and copy")
+ E( #copies_with_my_observations | lottery_loss, "buy and copy" ) * p(lottery_loss | "buy and copy")
Let's plug in the numbers:
So: E(donated_dollars | "buy and copy", observations) = $100 * (1001 * 0.01 / [1001 * 0.01 + 1 * 0.99]) = $100 * (10.01 / 11) = $91.
Since $91 > $10, SIA+EDT thinks the "buy and copy" strategy is better than the "don't buy" strategy.
IMO, this is a pretty decisive argument against these versions of SIA+EDT. (Though maybe they could be tweaked in some way to improve the situation.)
(Writing this out partly as a reference for my future self, since I find myself referring to this every now and then, and partly as a response to this post.)
Yep, resolutions not very reliable.
The drone delivery one was claude claiming:
Kiwibot has operated delivery robots in Berkeley since 2017, founded in UC Berkeley’s Skydeck incubator. Delivers food within approximately one mile of campus with over 250,000 total deliveries completed.
Googling quickly, there are claims that it has since shut down and also that it was remote-controlled rather than fully autonomous. In any case, it'd be pretty niche and clearly only available due to the novelty value.
Robotaxis in 20+ cities was something claude initially thought false and then gpt-5.1 thought it was "borderline true" based on a bunch of baidu deployments. E.g. source. No idea whether that holds up, idk the robotaxi situation in china. (Also that news is slightly after september 22.)
I also think the starcraft one is probably wrong. Looking now, the models seem to be mainly leaning on 2019 cites, which I think weren't sufficient to show AI consistently beating humans.
Sept 22nd 2025 has passed now, which is the date that the first column of probabilities was referring to.
I was curious how they turned out so I asked a Claude (don't remember which one) to judge whether the events had happened or not. And then got GPT-5.1-thinking to do check if it agreed with Claude's judgments. (With disagreements between Claude and GPT-5.1 lazily adjudicated by me.) Here's the link to the GPT-5.1 convo if you're interested. (Results at the bottom.) There might well be major errors in the LLM's judgments and my adjudications.
If you yourself can invest in VARA, then for sure you'd prefer to get the money earlier rather than later. Then the question would instead turn into a question about why your discount rate is so low, since you should be able to grow it faster than that. Though sounds like you think that's explained by risk-aversion + heavy correlations with your other funding streams, which isn't crazy; I haven't run any numbers.
Does that mean that you'd prefer donors invested in VARA or SALP to donate in a future year? I think they'll probably do better than 25%/year, even with some reasonable risk-adjustments. (Though maybe the calculus changes if you've got tons of other prospective donors invested with them.)
Thanks, that makes sense.
One way to help clarify the effect from (I) would be to add error bars to the individual data points. Presumably models with fewer data points would have wider error bars, and then it would make sense that they pull less hard on the gregression.
Error bars would also be generally great to better understand how much weight to give to the results. In cases where you get a low p-value, I have some sense of this. But in cases like figure 13 where it's a null result, it's hard to tell whether that's strong evidence of an absence of effect, or whether there's just too little data and not much evidence either way.