Much of my hope is that by the time we reach a superintelligence level where we need to instill reflectively endorsed values to optimize towards in a very hands-off way rather than just constitutions, behaviors, or goals, we'll have figured something else out. I'm not claiming the optimizer advantage alone is enough to be decisive in saving the world.
To the point about tighter feedback loops, I see the main benefit as being in conjunction with adapting to new problems. Suppose that we notice AIs take some bad but non-world-ending action like murdering people; then we can add a big dataset of situations in which AIs shouldn't murder people to the training data. If we were instead breeding animals, we would have to wait dozens of generations for mutations that reduce murder rate to appear and reach fixation. Since these mutations affect behavior through brain architecture, they would have a higher chance of deleterious effects. And if we're also selecting for intelligence, they would be competing against mutations that increase intelligence, producing a higher alignment tax. All this means that we have less chance to detect whether our proxies hold up (capabilities researchers have many of these advantages too, but the AGI would be able to automate capabilities training anyway).
If we expect problems to get worse at some rate until an accumulation of unsolved alignment issues culminates in disempowerment, it seems to me there is a large band of rates where we can stay ahead of them with AI training but evolution wouldn't be able to.
Noted. Somewhat surprised you believe in quantum immortality, is there a particular reason?
EJT's incomplete preferences proposal. But as far as I'm able to make out from the comments, you need to define a decision rule in addition to the utility function of an agent with incomplete preferences, and only some of those ways are compatible with shutdownability.
When I read it in school, the story frustrated me because I immediately wanted to create Omelas seeing as it's a thousand times better than our society, so I didn't really get the point of the intended and/or common interpretations.
Gradient descent (when applied to train AIs) allows much more fine-grained optimization than evolution, for these reasons:
It's unclear the degree to which these will solve inner alignment problems or cause AI goals to be more robust than animal goals to distributional shift, but we're in much better shape than evolution was.
If you disagree with much of IABIED but are still worried about AI risk, maybe the question to ask is "will the radical flank effect be positive or negative on mainstream AI safety movements?", which seems more useful than "do I on net agree or disagree?" or "will people taking this book at face value do useful or anti-useful things?" Here's what Wikipedia has to say on the sign of a radical flank effect:
It's difficult to tell without hindsight whether the radical flank of a movement will have positive or negative effects.[2] However, following are some factors that have been proposed as making positive effects more likely:
- Greater differentiation between moderates and radicals in the presence of a weak government.[2][13][14]: 411 As Charles Dobson puts it: "To secure their place, the new moderates have to denounce the actions of their extremist counterparts as irresponsible, immoral, and counterproductive. The most astute will quietly encourage 'responsible extremism' at the same time."[15]
- Existing momentum behind the cause. If change seems likely to happen anyway, then governments are more willing to accept moderate reforms in order to quell radicals.[2]
- Radicalism during the peak of activism, before concessions are won.[16] After the movement begins to decline, radical factions may damage the image of moderate organizations.[16]
- Low polarization. If there's high polarization with a strong opposing side, the opposing side can point to the radicals in order to hurt the moderates.[2]
Of course it's still useful to debate on which factual points the book is accurate, but making judgments of the book's overall value requires modeling other parts of the world.
Yeah I expect corrigibility to get a lot worse by the 10x economy level with at least 15% probability, as my uncertainty is very large, just not in the median case. The main reason is that we don't need to try very hard yet to get sufficient corrigibility from models. My very rough model is even if the amount of corrigibility training required, say, 2x every time horizon doubling, whereas the amount of total training required 1.5x per time horizon doubling, we will get 10 more time horizon doublings with only a (2/1.5)^10 = 18x increase in relative effort dedicated to corrigibility training. This seems doable given that relatively little of the system cards of current models is dedicated to shutdownability.
As for (b), my guess is that with NO corrigibility training at all, models would start doing things like disabling shutdown scripts in the wild, locking users out of their computers and changing their passwords to prevent them manually shutting the agent down, and there would be outcry from public and B2B customers that hurts their profits, as well as a dataset of examples to train against. It's plausible that fixing this doesn't allow them to stop disempowerment. Maybe a crux is whether naively training for corrigibility is more effective or less effective on more egregious incidents.
After the 10xing the economy stage, plausibly corrigibility stops being useful for users / aligned with the profit motive because humans will largely stop understanding what models are doing, so I get a lot more worried if we haven't figured something out.
After thinking about it more, it might take more than 3% even if things scale smoothly because I'm not confident corrigibility is only a small fraction of labs' current safety budgets
Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user's inferred preferences in every domain, not in the sense of AI that only understands physics. I don't expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what's good for business is reasonably aligned, and getting there requires something like 3% of the lab's resources.
Does this meaningfully reduce the probability that you jump out of the way of a car or get screened for heart disease? The important thing isn't whether you have an emotional fear response, but how the behavior pattern of avoiding generalizes.