Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel. See also LinkedIn.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Wiki Contributions

Comments

Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.

In the following, all infradistributions are crisp.

Fix finite action set  and finite observation set .  For any  and , let

be defined by

In other words, this kernel samples a time step  out of the geometric distribution with parameter , and then produces the sequence of length  that appears in the destiny starting at .

For any continuous[1] function , we get a decision rule. Namely, this rule says that, given infra-Bayesian law  and discount parameter , the optimal policy is

The usual maximin is recovered when we have some reward function  and corresponding to it is

Given a set  of laws, it is said to be learnable w.r.t.  when there is a family of policies  such that for any 

For  we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any  we have the learnable decision rule

This is the "mesomism" I taked about before

Also, any monotonically increasing  seems to be learnable, i.e. any  s.t. for  we have . For such decision rules, you can essentially assume that "nature" (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting.

On the other hand, decision rules of the form  are not learnable in general, and so are decision rules of the form  for  monotonically increasing.

Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing?

A positive answer to the above would provide interesting generalizations of infra-Bayesianism. A negative answer to the above would provide an interesting novel justification of the maximin. Indeed, learnability is not a criterion that was ever used in axiomatic constructions of decision theory[3], AFAIK.

  1. ^

    We can try considering discontinuous functions as well, but it seems natural to start with continuous. If we want the optimal policy to exist, we usually need  to be at least upper semicontinuous.

  2. ^

    There are weaker conditions than "communicating" that are sufficient, e.g. "resettable" (meaning that the agent can always force returning to the initial state), and some even weaker conditions that I will not spell out here.

  3. ^

    I mean theorems like VNM, Savage etc.

First, given nanotechnology, it might be possible to build colonies much faster.

Second, I think the best way to live is probably as uploads inside virtual reality, so terraforming is probably irrelevant.

Third, it's sufficient that the colonists are uploaded or cryopreserved (via some superintelligence-vetted method) and stored someplace safe (whether on Earth or in space) until the colony is entirely ready.

Fourth, if we can stop aging and prevent other dangers (including unaligned AI), then a timeline of decades is fine.

I don't know whether we live in a hard-takeoff singleton world or not. I think there is some evidence in that direction, e.g. from thinking about the kind of qualitative changes in AI algorithms that might come about in the future, and their implications on the capability growth curve, and also about the possibility of recursive self-improvement. But, the evidence is definitely far from conclusive (in any direction).

I think that the singleton world is definitely likely enough to merit some consideration. I also think that some of the same principles apply to some multipole worlds.

Commit to not make anyone predictably regret supporting the project or not opposing it" is worrying only by omission -- it's a good guideline, but it leaves the door open for "punish anyone who failed to support the project once the project gets the power to do so".

Yes, I never imagined doing such a thing, but I definitely agree it should be made clear. Basically, don't make threats, i.e. don't try to shape others incentives in ways that they would be better off precommitting not to go along with it.

It's not because they're not on Earth, it's because they have a superintelligence helping them. Which might give them advice and guidance, take care of their physical and mental health, create physical constraints (e.g. that prevent violence), or even give them mind augmentation like mako yass suggested (although I don't think that's likely to be a good idea early on). And I don't expect their environment to be fragile because, again, designed by superintelligence. But I don't know the details of the solution: the AI will decide those, as it will be much smarter than me.

I don't have to know in advance that we're in hard-takeoff singleton world, or even that my AI will succeed to achieve those objectives. The only thing I absolutely have to know in advance is that my AI is aligned. What sort of evidence will I have for this? A lot of detailed mathematical theory, with the modeling assumptions validated by computational experiments and knowledge from other fields of science (e.g. physics, cognitive science, evolutionary biology). 

I think you're misinterpreting Yudkowsky's quote. "Using the null string as input" doesn't mean "without evidence", it means "without other people telling me parts of the answer (to this particular question)".

I'm not sure what is "extremely destructive and costly" in what I described? Unless you mean the risk of misalignment, in which case, see above.

I know, this is what I pointed at in footnote 1. Although "dumbest AI" is not quite right: the sort of AI MIRI envision is still very superhuman in particular domains, but is somehow kept narrowly confined to acting within those domains (e.g. designing nanobots). The rationale mostly isn't assuming that at that stage it won't be possible to create a full superintelligence, but assuming that aligning such a restricted AI would be easier. I have different views on alignment, leading me to believe that aligning a full-fledged superintelligence (sovereign) is actually easier (via PSI or something in that vein). On this view, we still need to contend with the question, what is the thing we will (honestly!) tell other people that our AI is actually going to do. Hence, the above.

People like Andrew Critch and Paul Christiano have criticized MIRI in the past for their "pivotal act" strategy. The latter can be described as "build superintelligence and use it to take unilateral world-scale actions in a manner inconsistent with existing law and order" (e.g. the notorious "melt all GPUs" example). The critics say (justifiably IMO), this strategy looks pretty hostile to many actors and can trigger preemptive actions against the project attempting it and generally foster mistrust.

Is there a good alternative? The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat "apples and oranges". Suppose that we do live in a hard-takeoff singleton world, what then? One answer is "create a trustworthy, competent, multinational megaproject". Alright, but suppose you can't create a multinational megaproject, but you can build aligned AI unilaterally. What is a relatively cooperative thing you can do which would still be effective?

Here is my proposed rough sketch of such a plan[1]:

  • Commit to not make anyone predictably regret supporting the project or not opposing it. This rule is the most important and the one I'm the most confident of by far. In an ideal world, it should be more-or-less sufficient in itself. But in the real world, it might be still useful to provide more tangible details, which the next items try to do.
  • Within the bounds of Earth, commit to obey the international law, and local law at least inasmuch as the latter is consistent with international law, with only two possible exceptions (see below). Notably, this allows for actions such as (i) distributing technology that cures diseases, reverses aging, produces cheap food etc. (ii) lobbying for societal improvements (but see superpersuation clause below).
  • Exception 1: You can violate any law if it's absolutely necessary to prevent a catastrophe on the scale comparable with a nuclear war or worse, but only to the extent it's necessary for that purpose. (e.g. if a lab is about to build unaligned AI that would kill millions of people and it's not possible to persuade them to stop or convince the authorities to act in a timely manner, you can sabotage it.)[2]
  • Build space colonies. These space colonies will host utopic societies and most people on Earth are invited to immigrate there.
  • Exception 2: A person held in captivity in a manner legal according to local law, who faces death penalty or is treated in a manner violating accepted international rules about treatment of prisoners, might be given the option to leave to the colonies. If they exercise this option, their original jurisdiction is permitted to exile them from Earth permanently and/or bar them from any interaction with Earth than can plausibly enable activities illegal according to that jurisdiction[3].
  • Commit to adequately compensate any economy hurt by emigration to the colonies or other disruption by you. For example, if space emigration causes the loss of valuable labor, you can send robots to supplant it.
  • Commit to not directly intervene in international conflicts or upset the balance of powers by supplying military tech to any side, except in cases when it is absolutely necessary to prevent massive violations of international law and human rights.
  • Commit to only use superhuman persuasion when arguing towards a valid conclusion via valid arguments, in a manner that doesn't go against the interests of the person being persuaded. 
  1. ^

    Importantly, this makes stronger assumptions about the kind of AI you can align than MIRI-style pivotal acts. Essentially, it assumes that you can directly or indirectly ask the AI to find good plans consistent with the commitments below, rather than directing it to do something much more specific. Otherwise, it is hard to use Exception 1 (see below) gracefully.

  2. ^

    A more conservative alternative is to limit Exception 1 to catastrophes that would spill over to the space colonies (see next item).

  3. ^

    It might be sensible to consider a more conservative version which doesn't have Exception 2, even though the implications are unpleasant.

Ratfic idea / conspiracy theory: Yudkowsky traveled back in time to yell at John Nash about how Nash equilibria are stupid[1], and that's why Nash went insane.

h/t Marcus (my spouse)

  1. ^

    They are.

Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.

I missed that paragraph on first reading, mea culpa. I think that your story about how it's a win for interpretability and alignment is very unconvincing, but I don't feel like hashing it out atm. Revised to weak downvote.

Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?

Load More