My immediate reaction is: you should definitely update as far as you can and do this investigation! But no matter how much you investigate the learned preferences, you should still deploy your AI with some residual uncertainty because it's unlikely you can update it "all the way". Two reasons why this might be
I'm not sure why Rohin thinks the arguments against CIRL are bad, but I wrote a post today on why I think the argument from fully updated deference / corrigibility is weak. I also found Paul Christiano's response very helpful as an outline of objections to the utility uncertainty agenda.
Also relevant is this old comment from Rohin on difficulties with utility uncertainty.
This is really helpful, thanks. Perhaps the only disagreement here is pedagogical; I think it's more useful to point people excited about utility uncertainty to the easy goal inference problem is still hard and to Model Mis-specification and Inverse Reinforcement Learning, because these engage directly with the premises of the approach. Arguing that it violates corrigibility, a concept that doesn't fit cleanly in the CIRL framework, is more likely to lead to confusion than understanding the problems (at least it did for me).
On the object level, I basically agree with Russell that a good enough solution to value learning seems very valuable since it expands the level of AI capabilities we can deploy safely in the world and buys us more time - basically the "stopgap" approach you mention. Composed with other agendas like automating AI alignment research, it might even prove decisive.
And framing CIRL in particular as a problem formalization rather than a solution approach seems right. I've found it very helpful to have a precise mathematical object like "CIRL" to point to when discussing the alignment problem with AI researchers, in contrast to the clusters of blog posts defining things like "alignment" and "corrigibility".
You're right that it's an ongoing research area but there's a number of approaches that work relatively well. This NeurIPS tutorial describes a few. Probably the easiest thing is to use one of the calibration methods mentioned there to get your classifier to output calibrated uncertainties for each class, then say "I don't know" if the network isn't at least 90% confident in one of the 10 classes.
This is very helpful as a roadmap connecting current interpretability techniques to the techniques we need for alignment.
One thing that seems very important but missing is how the tech tree looks if we factor in how SOTA models will change over time, including
For example, if we restricted our attention to ConvNets trained on MNIST-like datasets we could probably get to tech level (6) very quickly. But would this would help with solving transparency for transformers trained on language? And if we don't expect it to help, why do we expect solving transparency for transformers will transfer over to the architectures that will be dominant 5 years from now?
My tentative answer would be that we don't really know how much transparency generalizes between scales/architectures/learning algorithms, so to be safe we need to invest in enough interpretability work to both keep up with whatever the SOTA models are doing and to get higher and higher in the tech tree. This may be much, much harder than the "single tech tree" metaphor suggests.
Thought-provoking post, though as you hinted it's not fair to directly compare "classification accuracy" with "accuracy at avoiding catastrophe". Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time. Instead, human ability to avoid catastrophe is an ability to generate conservative action sequences in response to novel physical and social situations - e.g. if I'm driving and I see something I don't understand up ahead I'll slow down just in case.
I imagine if our goal was "never misclassify an MNIST digit" we could get to 6-7 nines of "worst-case accuracy" even out of existing neural nets, at the cost of saying "I don't know" for the confusing 0.2% of digits.
** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.
Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant - it makes a number of assumptions about agents and utility functions and I wasn't able to connect it to why I should expect an agent trained using CIRL to kill me.
FWIW here's my alternative answer:
CIRL agents are bottlenecked on the human overseer's ability to provide them with a learning signal through demonstration or direct communication. This is unlikely to scale to superhuman abilities in the agent, so superintelligent agents simply will not be trained using CIRL.
In other words it's only a solution to "Learn from Teacher" in Paul's 2019 decomposition of alignment, not to the whole alignment problem.
Mind control is too extreme; I think world superpowers could be coordinated with levels of persuasion greater than one Eliezer but short of mind control. E.g. people are already building narrow persuasion AI capable of generating arguments that are highly persuasive for specific people. A substantially-superhuman but still narrow version of such an AI will very likely be built in the next 5 years, and could be used in a variety of weak pivotal acts (not even in a manipulative way! even a public demonstration of such an AI would make a strong case for coordination, comparable to various weapons treaties).
A lot of important warnings in this post. "Capabilities generalize further than alignment once capabilities start to generalize far" was novel to me and seems very important if true.
I don't really understand the emphasis on "pivotal acts", though; there seems to be tons of weak pivotal acts, e.g. ways in which narrow AI or barely-above-human-AGI could help coordinate a global emergency regulatory response by the AI superpowers. Still might be worth focusing our effort on the future worlds where no weak pivotal acts are available, but important to point out this is not the median world.
I agree that modelling all human mistakes seems about as hard as modelling all of human values, so straightforward IRL is not a solution to the goal inference problem, only a reshuffling of complexity.
However, I don't think modelling human mistakes is fundamental to the goal inference problem in the way this post claims.
For example, you can imagine goal inference being solved along the lines of extrapolated volition: we give humans progressively more information about the outcomes of their actions and time to deliberate, and let the AI try to generalize to the limit of a human with infinite information and deliberation time (including time to deliberate about what information to attend to). It's unclear whether this limiting generalization would count as a sufficiently "reasonable representation" to solve the easy goal inference problem, but it's quite possible that it solves the full goal inference problem.
Another way we can avoid modelling all human mistakes is if we don't try to model all of human values, just the ones that are relevant to catastrophic / disempowering actions the AI could take. It seems plausible that there's a pretty simple description of some human cognitive limitations which if addressed, would eliminate the vast majority of risk, even if it can't help the AI decide whether the human would prefer to design a new city (to use Paul's example) more like New York or more like Jakarta. This would also count as a good-enough solution to the goal inference problem that doesn't require solving the "easy goal inference problem" in the full generality stated here.