ejenner

Comments

The (not so) paradoxical asymmetry between position and momentum

Interesting thoughts re anthropic explanations, thanks!

I agree that asymmetry doesn't tell us which one is more fundamental, and I wasn't aiming to argue for either one being more fundamental (though position does feel more fundamental to me, and that may have shown through). What I was trying to say was only that they are asymmetric on a cognitive level, in the sense that they don't feel interchangeable, and that there must therefore be some physical asymmetry.

Still, I should have been more specific than saying "asymmetric", because not any kind of asymmetry in the Hamiltonian can explain the cognitive asymmetry. For the "forces decay with distance in position space" asymmetry, I think it's reasonably clear why this leads to cognitive asymmetry, but for the "position occurs as an infinite power series" asymmetry, it's not clear to me whether this has noticeable macro effects.

The (not so) paradoxical asymmetry between position and momentum

That sounds right to me, and I agree that this is sometimes explained badly.

Are you saying that this explains the perceived asymmetry between position and momentum? I don't see how that's the case, you could say exactly the same thing in the dual perspective (to get a precise momentum, you need to "sum up" lots of different position eigenstates).

If you were making a different point that went over my head, could you elaborate?

ejenner's Shortform

Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to "edit its source code", though probably only in a very limited way. I think it's an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.

Deceptive Alignment

I'm wondering if regularization techniques could be used to make the pure deception regime unstable.

As a simple example, consider a neural network that is trained with gradient descent and weight decay. If the parameters can be (approximately) split into a set that determines the mesa-objective and a set for everything else, then the gradient of the loss with respect to the "objective parameters" would be zero in the pure deception regime, so weight decay would ensure that the mesa-objective couldn't be maintained.

The learned algorithm might be able to prevent this by "hacking" its gradient as mentioned in the post, making the parameters that determine the mesa-objective also have an effect on its output. But intuitively, this should at least make it more difficult to reach a stable pure deception regime.

Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.

Using vector fields to visualise preferences and make them consistent
When a vector field has no “curl” [...], the vector field can be thought of as the gradient of a scalar field.

In case you weren't aware, this is no longer true if the state space has "holes" (formally: if its first cohomology group is non-zero). For example, if the state space is the Euclidean plane without the origin, you can have a vector field on that space which has no curl but isn't conservative (and thus is not the gradient of any utility function).

Why this might be relevant:

1. Maybe state spaces with holes actually occur, in which case removing the curl of the PVF wouldn't always be sufficient to get a utility function

2. The fact that zero curl only captures the concept of transitivity for certain state spaces could be a hint that conservative vector fields are a better concept to think about here than irrotational ones (even if it turns out that we only care about simply connected state spaces in practice)

EDIT: an example of an irrotational 2D vector field which is not conservative is defined for