paulfchristiano

Sequences

Iterated Amplification

Wiki Contributions

Comments

Prizes for ELK proposals
  1. "Bad reporter" = any reporter that gives unambiguously bad answers in some situations (in the ontology identification case, basically anything other than a direct translator)
  2. "use knowledge of direct translation" = it may be hard to learn direct translation because you need a bunch of parameters to specify how to do it, but these "bad" reporters may also need the same bunch of parameters (because they do direct translation in some situations)
  3. In the "upstream" counterexample, the bad reporter does direct translation under many circumstances but then sometimes uses a different heuristic that generates a bad answer. So the model needs all the same parameters used for direct translation, as mentioned in the last point. (I think your understanding of this was roughly right.)
  4. More like: now we've learned a reporter which contains what we want and also some bad stuff, you could imagine doing something like imitative generalization (or e.g. a different regularization scheme that jointly learned multiple reporters) in order to get just what we wanted.
Prizes for ELK proposals

I'd like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this:

  • Consistency checks will behave differently in W1 and W2. Even if a human can never produce different answers to Q1 and Q2, they can talk about situations where Q1 and Q2 differ and describe how the answers to those questions relate to all the other facts about the world (and to the answer to Q).
  • If language is rich enough, and we are precise enough with the formulation of questions, then you may hope that lots of other questions have different interpretations in W1 and W2, i.e. such that the simplest way of answering other questions will generalize correctly to Q.
  • In the case of amplification/debate, Q2 = "Does a human with AI assistants believe a diamond is in the room?" and so we can hope that in fact Q1 and Q2 have the same answers in all situations. (Though we aren't optimistic about this.)
Prizes for ELK proposals

In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don't do that is because it's likely to be too big to make much sense of.

We don't quite have access to the AI Bayes net---we just have a big neural network, and we sometimes talk about examples where what the neural net is doing internally can be well-described as "inference in a Bayes net."

So ideally a solution would use neither the human Bayes net or the AI Bayes net.

But when thinking about existing counterexamples, it can still be useful to talk about how we want an algorithm to behave in the case where the human/AI are using a Bayes net, and we do often think about ideas that use those Bayes nets (with the understanding that we'd ultimately need to refine them into approaches that don't depend on having an explicit Bayes net).

Prizes for ELK proposals

We're going to accept submissions through February 10.

(We actually ended up receiving more submissions than I expected but it seems valuable, and Mark has been handling all the reviews, so running for another 20 days seems worthwhile.)

Alex Ray's Shortform

"The goal is" -- is this describing Redwood's research or your research or a goal you have more broadly?

My general goal, Redwood's current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.

I'm curious how this is connected to "doesn't write fiction where a human is harmed".

"Don't produce outputs where someone is injured" is just an arbitrary thing not to do. It's chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it's plausible that a predicate with neutral valence would have been better to avoid confusion).

Alex Ray's Shortform

The goal is not to remove concepts or change what the model is capable of thinking about, it's to make a model that never tries to deliberately kill everyone. There's no doubt that it could deliberately kill everyone if it wanted to.

Prizes for ELK proposals

I'd be fine with a proposal that flips coins and fails with small probability (in every possible world).

The Solomonoff Prior is Malign

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?

The Solomonoff Prior is Malign

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

I'm not sure I understand what you mean by "decision-theoretic approach"

I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no).

This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.

I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).

Load More