Wiki Contributions


ELK prize results

I didn't see the proposals, but I think that almost all of the difficulty will be in how you can tell good from bad reporters by looking at them. If you have a precise enough description of how to do that, you can also use it as a regularizer. So the post hoc vs a priori thing you mention sounds more like a framing difference to me than fundamentally different categories. I'd guess that whether a proposal is promising depends mostly on how it tries to distinguish between the good and bad reporter, not whether it does so via regularization or via selection after training (since you can translate back and forth between those anyway).

(Though as a side note, I'd usually expect regularization to be much more efficient in practice, since if your training process has a bias towards the bad reporter, it might be hard to get any good reporters at all.)

If I'm mistaken, I'd be very interested to hear an example of a strategy that fundamentally only works once you have multiple trained models, rather than as a regularizer!

Inferring utility functions from locally non-transitive preferences

I enjoyed reading this! And I hadn't seen the interpretation of a logistic preference model as approximating Gaussian errors before.

Since you seem interested in exploring this more, some comments that might be helpful (or not):

  • What is the largest number of elements we can sort with a given architecture? How does training time change as a function of the number of elements?
  • How does the network architecture affect the resulting utility function? How do the maximum and minimum of the unnormalized utility function change?

I'm confused why you're using a neural network; given the small size of the input space, wouldn't it be easier to just learn a tabular utility function (i.e. one value for each input, namely its utility)? It's the largest function space you can have but will presumably also be much easier to train than a NN.

Questions like the ones you raise could become more interesting in settings with much more complicated inputs. But I think in practice, the expensive part of preference/reward learning is gathering the preferences, and the most likely failure modes revolve around things related to training an RL policy in parallel to the reward model. The architecture etc. seem a bit less crucial in comparison.

Which portion of possible comparisons needs to be presented (on average) to infer the utility function?

I thought about this and very similar questions a bit for my Master's thesis before changing topics, happy to chat about that if you want to go down this route. (Though I didn't think about inconsistent preferences, just about the effect of noise. Without either, the answer should just be  I guess.)

How far can we degenerate a preference ordering until no consistent utility function can be inferred anymore?

You might want to think more about how to measure this, or even what exactly it would mean if "no consistent utility function can be inferred". In principle, for any (not necessarily transitive) set of preferences, we can ask what utility function best approximates these preferences (e.g. in the sense of minimizing loss). The approximation can be exact iff the preferences are consistent. Intuitively, slightly inconsistent preferences lead to a reasonably good approximation, and very inconsistent preferences probably admit only very bad approximations. But there doesn't seem to be any point where we can't infer the best possible approximation at all.

Related to this (but a bit more vague/speculative): it's not obvious to me that approximating inconsistent preferences using a utility function is the "right" thing to do. At least in cases where human preferences are highly inconsistent, this seems kind of scary. Not sure what we want instead (maybe the AI should point out inconsistencies and ask us to please resolve them?).

The Unreasonable Feasibility Of Playing Chess Under The Influence

Performance deteriorating implies that the prior p is not yet a fixed point of p*=D(A(p*)).

At least in the case of AlphaZero, isn't the performance deterioration from A(p*) to p*? I.e. A(p*) is full AlphaZero, while p* is the "Raw Network" in the figure. We could have converged to the fixed point of the training process (i.e. p*=D(A(p*))) and still have performance deterioration if we use the unamplified model compared to the amplified one. I don't see a fundamental reason why p* = A(p*) should hold after convergence (and I would have been surprised if it held for e.g. chess or Go and reasonably sized models for p*).

The (not so) paradoxical asymmetry between position and momentum

Interesting thoughts re anthropic explanations, thanks!

I agree that asymmetry doesn't tell us which one is more fundamental, and I wasn't aiming to argue for either one being more fundamental (though position does feel more fundamental to me, and that may have shown through). What I was trying to say was only that they are asymmetric on a cognitive level, in the sense that they don't feel interchangeable, and that there must therefore be some physical asymmetry.

Still, I should have been more specific than saying "asymmetric", because not any kind of asymmetry in the Hamiltonian can explain the cognitive asymmetry. For the "forces decay with distance in position space" asymmetry, I think it's reasonably clear why this leads to cognitive asymmetry, but for the "position occurs as an infinite power series" asymmetry, it's not clear to me whether this has noticeable macro effects.

The (not so) paradoxical asymmetry between position and momentum

That sounds right to me, and I agree that this is sometimes explained badly.

Are you saying that this explains the perceived asymmetry between position and momentum? I don't see how that's the case, you could say exactly the same thing in the dual perspective (to get a precise momentum, you need to "sum up" lots of different position eigenstates).

If you were making a different point that went over my head, could you elaborate?

ejenner's Shortform

Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to "edit its source code", though probably only in a very limited way. I think it's an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.

Deceptive Alignment

I'm wondering if regularization techniques could be used to make the pure deception regime unstable.

As a simple example, consider a neural network that is trained with gradient descent and weight decay. If the parameters can be (approximately) split into a set that determines the mesa-objective and a set for everything else, then the gradient of the loss with respect to the "objective parameters" would be zero in the pure deception regime, so weight decay would ensure that the mesa-objective couldn't be maintained.

The learned algorithm might be able to prevent this by "hacking" its gradient as mentioned in the post, making the parameters that determine the mesa-objective also have an effect on its output. But intuitively, this should at least make it more difficult to reach a stable pure deception regime.

Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.

Using vector fields to visualise preferences and make them consistent
When a vector field has no “curl” [...], the vector field can be thought of as the gradient of a scalar field.

In case you weren't aware, this is no longer true if the state space has "holes" (formally: if its first cohomology group is non-zero). For example, if the state space is the Euclidean plane without the origin, you can have a vector field on that space which has no curl but isn't conservative (and thus is not the gradient of any utility function).

Why this might be relevant:

1. Maybe state spaces with holes actually occur, in which case removing the curl of the PVF wouldn't always be sufficient to get a utility function

2. The fact that zero curl only captures the concept of transitivity for certain state spaces could be a hint that conservative vector fields are a better concept to think about here than irrotational ones (even if it turns out that we only care about simply connected state spaces in practice)

EDIT: an example of an irrotational 2D vector field which is not conservative is defined for