Wiki Contributions


AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

I'm relatively a fan of their approach (although I haven't spent an enormous amount of time thinking about it). I like starting with problems which are concrete enough to really go at but which are microcosms for things we might eventually want.

I actually kind of think of truthfulness as sitting somewhere on the spectrum between the problem Redwood are working on right now and alignment. Many of the reasons I like truthfulness as medium-term problem to work on are similar to the reasons I like Redwood's current work.

AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

I think it would be an easier challenge to align 100 small ones (since solutions would quite possibly transfer across).

I think it would be a bigger victory to align the one big one.

I'm not sure from the wording of your question whether I'm supposed to assume success.

AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

To add to what Owain said:

  • I think you're pointing to a real and harmful possible dynamic
  • However I'm generally a bit sceptical of arguments of the form "we shouldn't try to fix problem X because then people will get complacent"
    • I think that the burden of proof lies squarely with the "don't fix problem X" side, and that usually it's good to fix the problem and then also give attention to the secondary problem that's come up
  • I note that I don't think of politicians and CEOs to be the primary audience of our paper
    • Rather I think in the next several years such people will naturally start having more of their attention drawn to AI falsehoods (as these become a real-world issue), and start looking for what to do about it
    • I think that at that point it would be good if the people they turn to are better informed about the possible dynamics and tradeoffs. I would like these people to have read work which builds upon what's in our paper. It's these further researchers (across a few fields) that I regard as the primary audience for our paper.
AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

I don't think I'm yet at "here's regulation that I'd just like to see", but I think it's really valuable to try to have discussions about what kind of regulation would be good or bad. At some point there will likely be regulation in this space, and it would be great if that was based on as deep an understanding as possible about possible regulatory levers, and their direct and indirect effects, and ultimate desirability.

I do think it's pretty plausible that regulation about AI and truthfulness could end up being quite positive. But I don't know enough to identify in exactly what circumstances it should apply, and I think we need a bit more groundwork on building and recognising truthful AI systems first. I guess quite a bit of our paper is trying to open the conversation on that.

"Good judgement" and its components
I think there's also a capability component, distinct from "understanding/modeling the world", about self-alignment or self-control - the ability to speak or act in accordance with good judgement, even when that conflicts with short-term drives.

In my ontology I guess this is about the heuristics which are actually invoked to decide what to do given a clash between abstract understanding of what would be good and short-term drives (i.e. it's part of meta-level judgement). But I agree that there's something helpful about having terminology to point to that part in particular. Maybe we could say that self-alignment and self-control are strategies for acting according to one's better judgement?

"Good judgement" and its components
Do you consider "good decision-making" and "good judgement" to be identical? I think there's a value alignment component to good judgement that's not as strongly implied by good decision-making.

I agree that there's a useful distinction to be made here. I don't think of it as fitting into "judgement" vs "decision-making" (and would regard those as pretty much the same), but rather about how "good" is interpreted/assessed. I was mostly using good to mean something like "globally good" (i.e. with something like your value alignment component), but there's a version of "prudentially good judgement/decision-making" which would exclude this.

I'm open to suggestions for terminology to capture this!

Acausal trade: double decrease

I think the double decrease effect kicks in with uncertainty, but not with confident expectation of a smaller network.

A permutation argument for comparing utility functions

I'm not sure I've fully followed, but I'm suspicious that you seem to be getting something for nothing in your shift from a type of uncertainty that we don't know how to handle to a type we do.

It seems to me like you must be making an implicit assumption somewhere. My guess is that this is where you used to pair with . If you'd instead chosen as the matching then you'd have uncertainty between whether should be or . My guess is that generically this gives different recommendations from your approach.

Learning Impact in RL

Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what's small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this).

Nevertheless I think this idea looks promising enough to explore further, would also like to hear David's reasons.

On motivations for MIRI's highly reliable agent design research

For #5, OK, there's something to this. But:

  • It's somewhat plausible that stabilising pivotal acts will be available before world-destroying ones;
  • Actually there's been a supposition smuggled in already with "the first AI systems capable of performing pivotal acts". Perhaps there will at no point be a system capable of a pivotal act. I'm not quite sure whether it's appropriate to talk about the collection of systems that exist being together capable of pivotal acts if they will not act in concert. Perhaps we'll have a collection of systems which if aligned would produce a win, or if acting together towards an unaligned goal would produce catastrophe. It's unclear if they each have different unaligned goals that we necessarily get catastrophe (though it's certainly not a comfortable scenario).

I like your framing for #1.

Load More