antimonyanthony

Posts

Sorted by New

Wiki Contributions

Comments

The case for becoming a black-box investigator of language models

It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive

Maybe I missed something here, but how is this supposed to help with deception? I thought the whole reason deceptive alignment is really hard to solve is that you can't tell if the AI's being deceptive via its behavior.

Why No *Interesting* Unaligned Singularity?

That all sounds fair. I've seen rationalists claim before that it's better for "interesting" things (in the literal sense) to exist than not, even if nothing sentient is interested by them, so that's why I assumed you meant the same.

Why No *Interesting* Unaligned Singularity?

Why does the person asking this question care about whether "interesting"-to-humans things happen, in a future where no humans exist to find them interesting?

The Commitment Races problem

Perhaps the crux here is whether we should expect all superintelligent agents to converge on the same decision procedure—and the agent themselves will expect this, such that they'll coordinate by default? As sympathetic as I am to realism about rationality, I put a pretty nontrivial credence on the possibility that this convergence just won't occur, and persistent disagreement (among well-informed people) about the fundamentals of what it means to "win" in decision theory thought experiments is evidence of this.

The Commitment Races problem

From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma.  I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself.

I don't see how this makes the point you seem to want it to make. There's still an equilibrium selection problem for a program game of one-shot PD—some other agent might have the program that insists (through a biased coin flip) on an outcome that's just barely better for you than defect-defect. It's clearly easier to coordinate on a cooperate-cooperate program equilibrium in PD or any other symmetric game, but in asymmetric games there are multiple apparently "fair" Schelling points. And even restricting to one-shot PD, the whole commitment races problem is that the agents don't have common knowledge before they choose their programs.

Dath Ilan vs. Sid Meier's Alpha Centauri: Pareto Improvements

Sort of! This paper (of which I’m a coauthor) discusses this “unraveling” argument, and the technical conditions under which it does and doesn’t go through. Briefly:

  • It’s not clear how easy it is to demonstrate military strength in the context of an advanced AI civilization, in a way that can be verified / can’t be bluffed. If I see that you’ve demonstrated high strength in some small war game, but my prior on you being that strong is sufficiently low, I’ll probably think you’re bluffing and wouldn’t be that strong in the real large-scale conflict.
  • Supposing strength can be verified, it might be intractable to do so without also disclosing vulnerable info (irrelevant to the potential conflict). As TLW's comment notes, the disclosure process itself might be really computationally expensive.
  • But if we can verifiably disclose, and I can either selectively disclose only the war-relevant info or I don’t have such a vulnerability, then yes you’re right, war can be avoided. (At least in this toy model where there’s a scalar “strength” variable; things can get more complicated in multiple dimensions, or where there isn’t an “ordering” to the war-relevant info.)
  • Another option (which the paper presents) is conditional disclosure—even if you could exploit me by knowing the vulnerable info, I commit to share my code if and only if you commit to share yours, play the cooperative equilibrium, and not exploit me.
MIRI announces new "Death With Dignity" strategy

The amount of EV at stake in my (and others') experiences over the next few years/decades is just too small compared to the EV at stake in the long-term future.

AI alignment isn't the only option to improve the EV of the long-term future, though.

Debating myself on whether “extra lives lived” are as good as “deaths prevented”

I think “the very repugnant conclusion is actually fine” does pretty well against its alternatives. It’s totally possible that our intuitive aversion to it comes from just not being able to wrap our brains around some aspect of (a) how huge the numbers of “barely worth living” lives would have to be, in order to make the very repugnant conclusion work; (b) something that is just confusing about the idea of “making it possible for additional people to exist.”

While this doesn't sound crazy to me, I'm skeptical that my anti-VRC intuitions can be explained by these factors. I think you can get something "very repugnant" on scales that our minds can comprehend (and not involving lives that are "barely worth living" by classical utilitarian standards). Suppose you can populate* some twin-Earth planet with either a) 10 people with lives equivalent to the happiest person on real Earth, or b) one person with a life equivalent to the most miserable person on real Earth plus 8 billion people with lives equivalent to the average resident of a modern industrialized nation.

I'd be surprised if a classical utilitarian thought the total happiness minus suffering in (b) was less than in (a). Heck, 8 billion might be pretty generous. But I would definitely choose (a).

To me the very-repugnance just gets much worse the more you scale things up. I also find that basically every suffering-focused EA I know is not scope-neglectful about the badness of suffering (at least, when it's sufficiently intense), or in any area other than population ethics. So it would be pretty strange if we just happened to be falling prey to that error in thought experiments where there's another explanation—i.e., we consider suffering especially important—which is consistent with our intuitions about cases that don't involve large numbers.

* As usual, ignore the flow-through effects on other lives.

A positive case for how we might succeed at prosaic AI alignment

I feel confused as to how step (3) is supposed to work, especially how "having the training be done by the model being trained given access to tools from (2)" is a route to this.

At some step in the amplification process, we'll have systems that are capable of deception, unlike the base case. So it seems that if we let the model train its successor using the myopia-verification tools, we need some guarantee that the successor is non-deceptive in the first place. (Otherwise the myopia-verification tools aren't guaranteed to work, as you note in the bullet points of step (2).) Are you supposing that there's some property other than myopia that the model could use to verify that its successor is non-deceptive, such that it can successfully verify myopia? What is that property? And do we have reason to think that property will only be guaranteed if the model doing the training is myopic? (Otherwise why bother with myopia at all—just use that other property to guarantee non-deception.)

Intuitively step (3) seems harder than (2), since in (3) you have to worry about deception creeping in to the more powerful successor agent, while (2) by definition only requires myopia verification of non-deceptive models.

ETA: Other than this confusion, I found this post helpful for understanding what success looks like to (at least one) alignment researcher, so thanks!

ARC's first technical report: Eliciting Latent Knowledge

Thanks, this makes it pretty clear to me how alignment could be fundamentally hard besides deception. (The problem seems to hold even if your values are actually pretty simple; e.g. if you're a pure hedonistic utilitarian and you've magically solved deception, you can still fail at outer alignment by your AI optimizing for making it look like there's more happiness and less suffering.)

Some (perhaps basic) notes to check that I've understood this properly:

  • The Bayes net running example per se isn't really necessary for ELK to be a problem.
    • The basic problem is that in training, the AI can do just as well by reporting what a human would believe given their observations, and upon deployment in more complex tasks the report of what a human would believe can come apart from the "truth" (what the human would believe given arbitrary knowledge of the system).
    • This seems to crop up for a variety of models of AI and human cognition.
  • It seems like the game is stacked against "doing X" rather than "making it look like X" in many contexts, such that even with regularizers that push towards the latter, the overall inductive bias would plausibly still be towards the former. It's just easier to make it look to humans like you're creating a utopia than to do all the complex work of utopia-building.
    • I suspect this would hold even for much less ambitious yet still superhuman tasks, such that deferring to future human-level aligned AIs wouldn't be sufficient.
    • But, if we train a reporter module, reporting what the human would believe doesn't seem prima facie easier than reporting the truth in this way. So that's why we might reasonably hope a good regularizer can break the tie.
  • In the build-break loop examples in the report, we're generously assuming the human overseers know the relevant set of questions to ask to check if there's malfeasance going on. And that this set isn't so hopelessly large that iterating through it for training is too slow.
  • In the imitative generalization example, it seems like besides the problem that the output Bayes net may be ontologically incomprehensible to humans, the training process requires humans to understand all the relevant hypotheses and data (to report their priors and likelihoods). This may be a general confusion about imitative generalization on my part.
  • If we tried distillation to get around the prohibitive slowness of amplification for the "AI science" proposal, that would introduce both inner alignment problems and perhaps bring us to the same sort of "alien ontology" problem as the imitative generalization proposal.
  • The ontology mismatch problem isn't just a possibility, it seems pretty likely by default, for reasons summarized in the plot of model interpretability here.
    • Intuitively, the ontology/primitive concepts that quantum physicists use to make very excellent predictions about the universe—better than I could make, certainly—are alien to me, and to anyone else who hasn't spent a lot of time learning quantum physics. This is consistent with human-interpretable concepts being more prevalent in recent powerful language models than in early-2010s neural networks.
  • Deferring to future human-level aligned AIs isn't sufficient because even if we had many more human-level minds giving feedback to superhuman AIs, they would still be faced with ELK too. i.e., This doesn't seem to be a problem that can be solved just by parallelizing across more overseers than we currently have, although having aligned assistants could of course still help with ELK research.
Load More