since one of the concerns is "will there actually be a 2028 election?" it's not obvious that this is happening fast enough to actually matter. I'm worried about a bunch of important institutions getting eroded in ways that are hard to recover from.
To be clear, do you think this mainly because of AI x-risk or Trump cueing the government? What’s probability would you assign to there not being an election?
It's in the top 7 things I consider dedicating this year to, maybe in the top 4.
What are the other 6?
Before accepting my current job, I was thinking about returning to Hungary and starting a small org with some old friends who have more coding experience, living on Eastern European salaries, and just churning out one simple experiment after another.
I think such an org should focus on automating simple safety research and paper replications with coding agents (e.g. Claude Code). My guess is that the models aren't capable enough yet to autonomously do Ryan's experiments but they may be in a generation or two, and working on this early seems valuable.
If this were true, wouldn't it imply that quantizing the model (or at least the KV cache) would improve performance?
I note that the hype has been almost entirely Claude Code in particular, skipping over OpenAI’s Codex or Google’s Jules. Claude Code with Opus 4.5 is, for now, special.
One reason I don't use Codex is that I don't want to pay a subscription to OpenAI. I'm ambivalent about Jules but I think it's a bit worse than the alternative.
Training an explanation assistant by automatically scoring descriptions. Given a set of descriptions, an LM-based scorer determines how well each one matches reality. We use this to re-rank and then train a specialized explainer model on high-scoring descriptions. Adapted from Choi et al. (2024).
This second example illustrates a key point: specialized models can outperform larger general-purpose models. In our case, a specialized explainer fine-tuned from Llama-8B outperformed the general-purpose GPT-4 model, which is likely an order of magnitude larger. This is direct evidence that we can decouple oversight capabilities from general capabilities.
Did you have independent human evaluations of your explainer model? With sufficient amount of training data, such methods have a tendency to reward hack the LM judge, generating explanations which sound good to the LM but not to humans.
On one hand, one would hope they are capable of resisting this pressure (these continual learners are really difficult to control, and even mundane liability might be really serious).
I share this hope, though I think not all labs are equally capable of doing so. For example, after the gpt-4o sycophancy incident I don’t have much confidence in OpenAI, but from private conversations and the RSP I have more confidence in Anthropic.
But on the other hand, it might be “not releasable” for purely technical reasons.
Seems plausible, but I would put this under the category of not having fully solved continual learning yet. A sufficiently capable continual learning agent should be able to do its own maintenance, short of a hardware failure.
If they had fully solved it, there would be large commercial pressure to release it as soon as possible, e.g. because they could start charging > $10K/month for remote worker subscriptions or increase their valuations in future funding rounds. It’s true that everyone is working on it; my guess is that they’ve made some progress but haven’t solved it yet.
May you be compassionate enough that your agency doesn’t narrow your circle of moral concern.
Claude Opus 4.5 can make decent triton kernels these days; I'd recommend using that if attention is a bottleneck.