No, I think the blue-team will keep having the latest and best LLMs and be able to stop such attempts from randos. These AGIs won't be so much magically superintelligent that they can take all the unethical actions needed to take over the world, without other AGIs stopping them.
On Chrome on a Mac you can just C-f in the PDF, it just OCRs automatically. I didn't have this problem.
But it feels like you'd need to demonstrate this with some construction that's actually adversarially robust, which seems difficult.
I agree it's kind of difficult.
Have you seen Nicholas Carlini's Game of Life series? It starts by building up logical gates up to a microprocessor that factors 15 in to 3 x 5.
Depending on the adversarial robustness model (e.g. every second the adversary can make 1 square behave the opposite of lawfully), it might be possible to make robust logic gates and circuits. In fact the existing circuits are a little robust already -- though not at the tune of 1 square per tick, that's too much power for the adversary.
I very strongly agree with this and think it should be the top objection people first see when scrolling down. In a low-P(doom) world, Anthropic has done lots of good. (They proved that you can have the best and the most aligned model, and also the leadership is more trustworthy than OpenAI, who would otherwise lead). This is my current view.
In a high-P(doom) world, none of that matters because they've raised the bar for capabilities when we really should be pausing AI. This was my previous view.
I'm grudgingly impressed with Anthropic leadership for getting this right when I did not (not that anyone other than me cares what I believed, having ~zero power).
What do you mean "faithfully execute"?
If you mean that it's executing its own goal, then I dispute that that will be random, the goal will be to be helpful and good.
Is this different from the standard instrumental convergence algorithm?
I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn't be well approximated by this process.
This is a concern. Two possible replies:
IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it's too much, but we'd see the base model scaling better than the RL'd model just like in this paper.
Fortunately, DeepSeek's Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.
It's not impossible that we are in an alignment-by-default world. But, I claim that our current insight isn't enough to distinguish such a world from the gradual disempowerment/going out with a whimper world.
Well, yeah, I agree with that! You might notice this item in my candidate "What's next" list:
- Prevent economic incentives from destroying all value. Markets have been remarkably aligned so far but I fear their future effects. (Intelligence Curse, Gradual Disempowerment. Remarkably European takes.)
This is not classical alignment emphasizing scheming etc. Rather it's "we get more slop, AIs outcompete humans and so the humans that don't own AIs have no recourse." So I don't think that undermines my point at all.
Approximately never, I assume, because doing so isn't useful.
You should not assume such things. Humans invented scheming to take over, it might be the very reason we are intelligent.
Until then, I claim we have strong reasons to believe that we just don't know yet.
We don't know but we never really know and must act under uncertainty. I put forward that we can make a good guess.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn't reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it's bad news for my current view:
And here's another prediction where I really stick my neck out, which isn't load-bearing to the view, but still increases my confidence, so defeating it is important:
I still disagree with several of the points, but for time reasons I request that readers not update against Evan's points if he just doesn't reply to these.
disagree that increasing capabilities are exponential in a capability sense. It's true that METR's time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
"we still extensively rely on direct human oversight and review to catch alignment issues" That's a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we'll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we've reached the point where you can maybe just use the actual Opus 3 as the weak model?
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don't update down on these due to lack of a response.
Yeah, true. It's gone so well for so long that I forgot. I didn't spend a lot of time thinking about this list.