This is awesome! A few miscellaneous thoughts:
I think this is totally fair. But the situation seems worse if your TAI is overconfident. I do think an important theory of victory here is "your correctly calibrated AI declares that it needs more time to figure out alignment and help coordinate/impose a slowdown."
That's fair. I guess I'm worried that forecasting only teaches a type of calibration that doesn't necessarily generalize broadly? Much to think about...
hmm it still seems useful to see whether the lesswrong community agrees with something without having an opinion about it yourself? Maybe it can be some sort of mouseover thing? Not sure
Sure, you haven't made any explicit claims about "limits of intelligence," but I guess I'm trying to counter these set of claims:
I replicated a version of this experiment where
And I found that steering still affects the type hint rate, although the model no longer has a compliance gap without any steering.
(Another difference here is that I'm using a larger set of questions.)
I think focusing on the "first AI smart enough" leads to a lot of low-EV research
Another post I want to write is I think getting slightly superhuman level aligned AIs is probably robustly good/very high value. I don't feel super confident in this but hopefully you'll see my flushed out thoughts on this soon.
I would say, it's quite possible that it's not that hard to make AIs that aren't overconfident, but it just isn't done anyways. Like because we're targeting near-human-level AIs built by actual AI companies that might operate very similar to how they work now, it's not that useful to reason about the "limits of intelligence."
Starting a shortform to keep track of tweets that I want to refer back to.
AI 2027's main contribution is the path to ASI/how to think about this progression (Dec 23, 2025)
Giving money to lightcone (Dec 15, 2025)
Do not contaminate the big beautiful pretraining corpus with human labels (Dec 14, 2025)
Simulators is the prior in LLMs and will always be relevant (Nov 23, 2025)
Russian joke about insulting the president in real life (Nov 20, 2025)
Unbounded task complexity (i.e., there are actually just a lot of things for AIs to learn) & low sample efficiency in AIs -> long timelines. (Nov 19, 2025 & Dec 21, 2025) (See also my commentary on task complexity when the METR study was first released on Mar 25, 2025)
59 page Neel Nanda MATS paper (Oct 30, 2025)
50 page Neel Nanda MATS application (Mar 3, 2025)
Agents that are capable of solving hard problems need not be consequentialist & reflectively stable (Mar 3, 2025)
Givewell girl summer (July 14, 2024)
My thesis on Fox New's Effects on Social Preferences (June 13, 2023)
I could use a better way to think sanely about warning shots.
Yeah I should probably spend some time thinking about this as well. My tentative take is that "well I wouldn't do this great safety intervention because it might avoid small AI catastrophes that kill a lot of people, but not all the people (and those catastrophes are actually good)" is suspicious reasoning. Like I'm so allergic to arguments of the form "allow bad thing to happen for the greater good."
Also, I feel like we can just run lots of training ablations to see which methods are load bearing for how aligned models seem. For example, if we removed RLHF, and then model just suddenly starts saying stuff about "My real goal is to hack into the Anthropic servers,"[1] then we should be pretty worried, and this doesn't require people to actually die in a catastrophe.
This is a result in an earlier version of Anthropic's Natural Emergent Misalignment from Reward Hacking paper which for some reason didn't make it into the final paper.
(fwiw I agree with Buck that the comment seemed unnecessarily rude and we should probably have less of rudeness on lesswrong, but I don't feel deterred from posting.)