Working on alignment at EleutherAI

Wiki Contributions


A positive case for how we might succeed at prosaic AI alignment

My attempt at a one sentence summary of the core intuition behind this proposal: if you can be sure your model isn’t optimizing for deceiving you, you can relatively easily tell if it’s trying to optimize for something you don’t want by just observing whether your model seems to be trying to do something obviously different from what you want during training, because it's much harder to slip under the radar by getting really lucky than by intentionally trying to.

Quadratic Voting and Collusion

Within the context of this post, collusion means any way of getting around the quadratic cost of voting by spreading the votes across multiple people. This is undesirable because it defeats the purpose of QV, in much the same way that allowing some people to cast multiple votes in regular voting would. The purpose of QV is to get an accurate picture of how much people care about things. You could add a separate layer to allow coalition forming, but the QV layer is the wrong place to allow this.

Further, while the examples I present in the post are symmetrical, there are also less symmetrical examples of collusion. For example, even with a secret ballot, if threatened with bodily harm, I expect a significant fraction of people to be intimidated into voting the way they are instructed (and this argument applies doubly for agents whose actions can be proven).

Quadratic Voting and Collusion

I think the main difficulty here is that defining non-collusion might be a bit tricky, but it could probably work with some assumptions.

[linkpost] Crypto Cities

That would be outside the power of an on-chain court (generally speaking), but I don't see why that's a big deal. You can give the court the authority to make new transactions, and I don't really see any case where it's imperative that the transaction history be altered rather than a new rollback transaction being issued.

[linkpost] Crypto Cities

You can definitely have mutable state on chain. I think you're confusing the immutable nature of the transaction history with the mutable latest chain state.

In Defence of Optimizing Routine Tasks

This is definitely a good reason and I've been in this exact situation before too.

Is GPT-3 already sample-efficient?

If anyone wants to try this with the Pile, you can download a copy of the Pile here and try GPT-J (6B, which is a lot less than GPT3's 175B) here (hosted) or through HF transformers (locally). If you run into any problems you can DM me or ask on the EleutherAI discord.

Meta learning to gradient hack

If my hypothesis about what the model is actually doing internally is correct, then it shouldn't work with anything other than a constant function. I'd be interested in seeing a version of this experiment but with, say, cos x and sin x or something.

Dissolving the Experience Machine Objection

I guess in that case I think what I'm doing is identifying the experience machine objection as being implied by Newton's flaming lazer sword, which I have far stronger convictions on. For those who reject NFLS, then I guess my argument doesn't really apply. However, at least I personally was in the category of people who firmly accept NFLS and also had reservations about the experience machine, so I don't think this implication is trivial.

As for the Andy and Bob situation, I think that objections like that can be similarly dissolved, given an acceptance of NFLS. If Bob has literally absolutely no way of finding out whether his wife and children truly love him, if they act exactly in the way they would if they really did, then I would argue that whether or not they "really" love him is equally irrelevant by NFLS. Our intuitions in this case are guided by the fact that in reality, Potemkin villages almost always eventually fall apart.

Meta learning to gradient hack

Here's a hand crafted way of doing gradient protection in this case I can think of: since these models are blocks of linear->bn(affine)->relu, if you make the beta in the affine really small, you can completely zero out the output of that block and then the rest of the model can only learn a constant function. You can also get around L2: just set i.e gamma to 0.00001 and beta to -0.01; this lets you have both really small parameter magnitudes and also still saturate the relu. As this model is trained on the base objective it should converge to a constant f(x) = 0. 

I think it would be really useful to look at the output after each relu, as well as the affine weights in the bns, to see if this is happening. If that doesn't show all-zeros anywhere, we can also look at the magnitude of the gradient and the eigenvalues of the Hessian (these models are small so it should be cheap to compute) in both the meta-optimized and non meta-optimized models, to get an idea of whether the model is in a critical point and whether it's a min/saddle.

I think that if the relu saturation thing is really what's happening then it's a pretty easy to deal with case of gradient hacking. With relu saturation you can either pass information through (and subject the model to gradients) or pass no information through (and protect the model). This is pretty useless in practice because this means you can only protect parts of the model you aren't using for anything in training. Things get a bit more interesting if you use sigmoids or something with distinct saturation regions, since then you can pass a single bit of info through per neuron while still protecting, though nobody uses sigmoid-like activations anymore (and the model can't just learn one, since that would then be changeable by the gradient).

Load More