...if the malign superintelligence knows what observations we would condition on, it can likely arrange to make the world match those observations, making the probability of our observations given a malign superintelligence roughly one

The probability of any observation given the existence of a malign superintelligence is 1? So P(observation | malign superintelligence) adds up to like a gajillion?

Reply

A list of core AI safety problems and how I hope to solve them

mattmacdermott1mo10

5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).

5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated "status quo". Infrabayesian uncertainty about the dynamics is the final component that removes this incentive.

If you know which variables you want to remove the incentive to control, an alternative to penalising divergence is path-specific objectives, i.e. you compute the score function under an intervention on the model that sets the irrelevant variables to their status quo values. Then the AI has no incentive to control the variables, but no incentive to keep them the same either.

Reply

Dalcy's Shortform

mattmacdermott2mo20

theorem is limited. only applies to cases where the decision node is not upstream of the environment nodes

I think you can drop this premise and modify the conclusion to “you can find a causal model for all variables upstream of the utility and not downstream of the decision.”

Reply

Examples of Highly Counterfactual Discoveries?

mattmacdermott3mo52

Lucius-Alexander SLT dialogue?

Reply

2

1

mattmacdermott's Shortform

mattmacdermott5mo10

Yep, exactly.

Reply

mattmacdermott's Shortform

mattmacdermott5mo10

Neural network interpretability feels like it should be called neural network interpretation.

Reply

Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"

mattmacdermott5mo1-1

If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.

I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.

Reply

Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"

mattmacdermott5mo10

I agree this agenda wants to solve ELK by making legible predictive models.

But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.

Reply

Shortform

mattmacdermott5mo234

It's not just a lesswrong thing (wikipedia).

My feeling is that (like most jargon) it's to avoid ambiguity arising from the fact that "commitment" has multiple meanings. When I google commitment I get the following two definitions:

the state or quality of being dedicated to a cause, activity, etc.
an engagement or obligation that restricts freedom of action

Precommitment is a synonym for the second meaning, but not the first. When you say, "the agent commits to 1-boxing," there's no ambiguity as to which type of commitment you mean, so it seems pointless. But if you were to say, "commitment can get agents more utility," it might sound like you were saying, "dedication can get agents more utility," which is also true.

Reply

4

2

Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"

mattmacdermott5mo50

tutorial

Reply