Wiki Contributions


Posted a question about this here:

I had been thinking about it in terms of capabilities research - is this likely to lead to capabilities advancements? My gut says that it is highly unlikely for such a toy model to advance capabilities.

The analogy to gain of function research does give me pause, though. I will have to think about what that way of thinking about it suggests.

My first thought I guess is that code is a little bit like a virus these days in terms of its ability to propagate itself - anything I post on colab could theoretically find its way into a Copilot-esque service (internal or external) from Google, and thence fragments of it could wind up in various programs written by people using such a service, and so on and so on. Which is a little bit scary I suppose, if I'm intentionally implementing tiny fragments of something scary.


The true loss function includes a term to incentivize going up: it's the squared distance to the line y=x (which I think of as the alignment loss) minus the y coordinate (which I think of as a capability loss). Since the alignment loss is quadratic and the capability loss is linear (and all the distances are at least one since we're on the integer grid), it should generally incentivize going up, but more strongly incentivize staying close to the line y=x.

If I had to guess, I would say that the models turning out unaligned just have some subtle sub-optimality in the training procedure that makes them not converge to the correct behavior.

Yes, it's linked in the text, but here it is again:

Yeah, just changing the max to a min produces this much smoother loss curve from your notebook..

Oops, did not read the post carefully enough, you've already linked to the colab!

Very cool stuff! Do you have the notebook on colab or something? Kind of want to find out how the story ends, whether that's in a second half video or just playing around with the code. At the end of this video you had what looked like fairly clean positional embeddings coming out of MLP0. Also the paying-attention-to-self in the second attention layer could plausibly be something to do with erasing the information that comes in on that token, since that's something that all transformer decoders have to do in some fashion or another.

Pretty sure the loss spikes were coming from using max rather than min when defining the learning rate schedule. Your learning rate multiplier starts at 1 and then linearly increases as step/100 once it reaches 100, which makes sense why it behaves itself for a while and then ultimately diverges for large numbers of steps.

Thanks for organizing!

Feedback: I was a little bit surprised to see a perfectly regular solution. (And I did relatively poorly because of my assumption that there would not be one.) I feel like real-world data is never as clean as this; on the other hand, all data benefits from taking a closer look at it and trying to understand if there are any regularities in the failure modes of your modeling toolkit, so maybe this is just a lesson for me. Hard to say!

After I posted my first post, but before reading the other answers, it occurred to me that I was probably leaving noise on the table by not modeling the individual Who children. Reading the other answers, it seems like doing that is key.

Revised results below when taking individual idosyncrasies into account in the ridge regression:



Andy Sue Who TrumTroopa FumFoozler

Betty Drew Who WhoWhonker BlumBlooper

Sally Sue Who BlumBlooper WhoWhonker

Phoebe Drew Who WhoWhonker BlumBlooper

Freddie Lou Who TrumTroopa GahGinka

Eddie Sue Who GahGinka FumFoozler

Cindy Drew Who SlooSlonker BlumBlooper

Mary Lou Who SlooSlonker WhoWhonker

Ollie Lou Who SlooSlonker FumFoozler

Johnny Drew Who TrumTroopa FumFoozler



Andy Sue Who SlooSlonker WhoWhonker

Betty Drew Who SlooSlonker FumFoozler

Sally Sue Who TrumTroopa FumFoozler

Phoebe Drew Who SlooSlonker FumFoozler

Freddie Lou Who WhoWhonker BlumBlooper

Eddie Sue Who BlumBlooper WhoWhonker

Cindy Drew Who GahGinka FumFoozler

Mary Lou Who TrumTroopa GahGinka

Ollie Lou Who WhoWhonker BlumBlooper

Johnny Drew Who BlumBlooper TrumTroopa

Ah, I got confused by Phoebe Drew Who, who shows up with ids 1533 and 1553.

Load More