"Good judgement" and its components
I think there's also a capability component, distinct from "understanding/modeling the world", about self-alignment or self-control - the ability to speak or act in accordance with good judgement, even when that conflicts with short-term drives.

In my ontology I guess this is about the heuristics which are actually invoked to decide what to do given a clash between abstract understanding of what would be good and short-term drives (i.e. it's part of meta-level judgement). But I agree that there's something helpful about having terminology to point to that part in particular. Maybe we could say that self-alignment and self-control are strategies for acting according to one's better judgement?

"Good judgement" and its components
Do you consider "good decision-making" and "good judgement" to be identical? I think there's a value alignment component to good judgement that's not as strongly implied by good decision-making.

I agree that there's a useful distinction to be made here. I don't think of it as fitting into "judgement" vs "decision-making" (and would regard those as pretty much the same), but rather about how "good" is interpreted/assessed. I was mostly using good to mean something like "globally good" (i.e. with something like your value alignment component), but there's a version of "prudentially good judgement/decision-making" which would exclude this.

I'm open to suggestions for terminology to capture this!

Acausal trade: double decrease

I think the double decrease effect kicks in with uncertainty, but not with confident expectation of a smaller network.

A permutation argument for comparing utility functions

I'm not sure I've fully followed, but I'm suspicious that you seem to be getting something for nothing in your shift from a type of uncertainty that we don't know how to handle to a type we do.

It seems to me like you must be making an implicit assumption somewhere. My guess is that this is where you used to pair with . If you'd instead chosen as the matching then you'd have uncertainty between whether should be or . My guess is that generically this gives different recommendations from your approach.

Learning Impact in RL

Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what's small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this).

Nevertheless I think this idea looks promising enough to explore further, would also like to hear David's reasons.

On motivations for MIRI's highly reliable agent design research

For #5, OK, there's something to this. But:

  • It's somewhat plausible that stabilising pivotal acts will be available before world-destroying ones;
  • Actually there's been a supposition smuggled in already with "the first AI systems capable of performing pivotal acts". Perhaps there will at no point be a system capable of a pivotal act. I'm not quite sure whether it's appropriate to talk about the collection of systems that exist being together capable of pivotal acts if they will not act in concert. Perhaps we'll have a collection of systems which if aligned would produce a win, or if acting together towards an unaligned goal would produce catastrophe. It's unclear if they each have different unaligned goals that we necessarily get catastrophe (though it's certainly not a comfortable scenario).

I like your framing for #1.

On motivations for MIRI's highly reliable agent design research

Thanks for the write-up, this is helpful for me (Owen).

My initial takes on the five steps of the argument as presented, in approximately decreasing order of how much I am on board:

  • Number 3 is a logical entailment, no quarrel here
  • Number 5 is framed as "therefore", but adds the assumption that this will lead to catastrophe. I think this is quite likely if the systems in question are extremely powerful, but less likely if they are of modest power.
  • Number 4 splits my intuitions. I begin with some intuition that selection pressure would significantly constrain the goal (towards something reasonable in many cases), but the example of Solomonoff Induction was surprising to me and makes me more unsure. I feel inclined to defer intuitions on this to others who have considered it more.
  • Number 2 I don't have a strong opinion on. I can tell myself stories which point in either direction, and neither feels compelling.
  • Number 1 is the step I feel most sceptical about. It seems to me likely that the first AIs which can perform pivotal acts will not perform fully general consequentialist reasoning. I expect that they will perform consequentialist reasoning within certain domains (e.g. AlphaGo in some sense reasons about consequences of moves, but has no conception of consequences in the physical world). This isn't enough to alleviate concern: some such domains might be general enough that something misbehaving in them would cause large problems. But it is enough for me to think that paying attention to scope of domains is a promising angle.
Further discussion of CFAR’s focus on AI safety, and the good things folks wanted from “cause neutrality”

This conclusion is way too strong. To just give one way: there's a big space of possibilities where discovering the planning fallacy in fact makes you less susceptible to the planning fallacy, but not immune.

CFAR's new mission statement (on our website)

I don't know who the intended audience for this is, but I think it's worth flagging that it seemed extremely jargon-heavy to me. I expect this to be off-putting to at least some people you actually want to attract (if it were one of my first interactions with CFAR I would be less inclined to engage again). In several cases you link to explanations of the jargon. This helps, but doesn't really solve the problem that you're asking the reader to do a large amount of work.

Some examples from the first few paragraphs:

  • clear and unhidden
  • original seeing
  • original making
  • existential risk
  • informational content [non-standard use]
  • thinker/doer
  • know the right passwords
  • double crux
  • outreach efforts
CFAR's new mission statement (on our website)

I found this document kind of interesting, but it felt less like what I normally understand as a mission statement, and more like "Anna's thoughts on CFAR's identity". I think there's a place for the latter, but I'd be really interested in seeing (a concise version of) the former, too.

If I had to guess right now I'd expect it to say something like:

We want to develop a community with high epistemic standards and good rationality tools, at least part of which is devoted to reducing existential risk from AI.

... but I kind of expect you to think I have the emphasis there wrong in some way.

Load More