Stuart_Armstrong

Model splintering: moving from one imperfect model to another

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).

Model splintering: moving from one imperfect model to another

Thanks! Lots of useful insights in there.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.

Generalised models as a category

Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.

Generalised models as a category

Thanks! Corrected both of those; is a subset of .

Stuart_Armstrong's Shortform

Thanks! That's useful to know.

Introduction to Cartesian Frames

Did posts on generalised models as a category and how one can see Cartesian frames as generalised models.

Stuart_Armstrong's Shortform

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines for some but not all and (with for being the whole set of outcomes).

This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .

Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .

Eg: , , will logically imply the value of .

Introduction to Cartesian Frames

I like it. I'll think about how it fits with my ways of thinking (eg model splintering).

Counterfactual control incentives

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.