I'm relatively a fan of their approach (although I haven't spent an enormous amount of time thinking about it). I like starting with problems which are concrete enough to really go at but which are microcosms for things we might eventually want.
I actually kind of think of truthfulness as sitting somewhere on the spectrum between the problem Redwood are working on right now and alignment. Many of the reasons I like truthfulness as medium-term problem to work on are similar to the reasons I like Redwood's current work.
I think it would be an easier challenge to align 100 small ones (since solutions would quite possibly transfer across).
I think it would be a bigger victory to align the one big one.
I'm not sure from the wording of your question whether I'm supposed to assume success.
To add to what Owain said:
I don't think I'm yet at "here's regulation that I'd just like to see", but I think it's really valuable to try to have discussions about what kind of regulation would be good or bad. At some point there will likely be regulation in this space, and it would be great if that was based on as deep an understanding as possible about possible regulatory levers, and their direct and indirect effects, and ultimate desirability.
I do think it's pretty plausible that regulation about AI and truthfulness could end up being quite positive. But I don't know enough to identify in exactly what circumstances it should apply, and I think we need a bit more groundwork on building and recognising truthful AI systems first. I guess quite a bit of our paper is trying to open the conversation on that.
I think there's also a capability component, distinct from "understanding/modeling the world", about self-alignment or self-control - the ability to speak or act in accordance with good judgement, even when that conflicts with short-term drives.
In my ontology I guess this is about the heuristics which are actually invoked to decide what to do given a clash between abstract understanding of what would be good and short-term drives (i.e. it's part of meta-level judgement). But I agree that there's something helpful about having terminology to point to that part in particular. Maybe we could say that self-alignment and self-control are strategies for acting according to one's better judgement?
Do you consider "good decision-making" and "good judgement" to be identical? I think there's a value alignment component to good judgement that's not as strongly implied by good decision-making.
I agree that there's a useful distinction to be made here. I don't think of it as fitting into "judgement" vs "decision-making" (and would regard those as pretty much the same), but rather about how "good" is interpreted/assessed. I was mostly using good to mean something like "globally good" (i.e. with something like your value alignment component), but there's a version of "prudentially good judgement/decision-making" which would exclude this.
I'm open to suggestions for terminology to capture this!
I think the double decrease effect kicks in with uncertainty, but not with confident expectation of a smaller network.
I'm not sure I've fully followed, but I'm suspicious that you seem to be getting something for nothing in your shift from a type of uncertainty that we don't know how to handle to a type we do.
It seems to me like you must be making an implicit assumption somewhere. My guess is that this is where you used i to pair S with S′. If you'd instead chosen j=i∘ρ as the matching then you'd have uncertainty between whether m should be j or ρ−1∘j. My guess is that generically this gives different recommendations from your approach.
Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what's small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this).
Nevertheless I think this idea looks promising enough to explore further, would also like to hear David's reasons.
For #5, OK, there's something to this. But:
I like your framing for #1.