Alignment Stream of Thought

Wiki Contributions


In the limit of infinite SAE width and infinite (iid) training data, you can get perfect reconstruction and perfect sparsity (both L0 and L1). We can think of this as maximal feature splitting. Obviously, this is undesirable, because you've discarded all of the structure present in your data.

Therefore, reconstruction and sparsity aren't exactly the thing we most fundamentally care about. It just happens to do something reasonable at practical scales. However, that doesn't mean we have to throw it out - we might hope that it gives us enough of a foothold in practice.

In particular, the maximal feature splitting case requires exponentially many latents. We might believe that in practice, on the spectrum from splitting too little (polysemanticity) to splitting too much, erring on the side of splitting too much is preferable, because we can still do circuit finding and so on if we artificially cut some existing features into smaller pieces.

For the dashboards, did you filter out the features that fire less frequently? I looked through a few and didn't notice any super low density ones.

For your dashboards, how many tokens are you retrieving the top examples from?

Why do you scale your MSE by 1/(x_centred**2).sum(dim=-1, keepdim=True).sqrt() ? In particular, I'm confused about why you have the square root. Shouldn't it just be 1/(x_centred**2).sum(dim=-1, keepdim=True)?

hot take: if you find that your sentences can't be parsed reliably without brackets, that's a sign you should probably refactor your writing to be clearer

a tentative model of ambitious research projects

when you do a big research project, you have some amount of risk you can work with - maybe you're trying to do something incremental, so you can only tolerate a 10% chance of failure, or maybe you're trying to shoot for the moon and so you can accept a 90% chance of failure.

budgeting for risk is non negotiable because there are a lot of places where risk can creep in - and if there isn't, then you're not really doing research. most obviously, your direction might just be a dead end. but there are also other things that might go wrong: the code might end up too difficult to implement, or it might run too slowly, or you might fail to fix a solvable-in-principle problem that comes up.

I claim that one of the principal components of being a good researcher is being able to eliminate as much unnecessary risk as possible, so you can spend your entire risk budget on the important bets.

for example, if you're an extremely competent engineer, when brainstorming experiments you don't have to think much about the risk that you fail to implement it. you know that even if you don't think through all the contingencies that might pop up, you can figue it out, because you have a track record of figuring it out. you can say the words "and if that happens we'll just scale it up" without spending much risk because you know full well that you can actually execute on it. a less competent engineer would have to pay a much greater risk cost, and correspondingly have to reduce the ambitiousness of the research bets (or else, take on way more risk than intented).

not all research bets are created equal, either. the space of possible research bets is vast, and most of them are wrong. but if you have very good research taste, you can much more reliably tell whether a bet is likely to work out. even the best researchers can't just look at a direction and know for sure if it will work, if you know that you get a good direction 10% of the time you can do a lot more than if your direction is only good 0.1% of the time.

finally, if you know and trust someone to be reliable at executing on their area of expertise, you can delegate things that fall in their domain to them. in practice, this can be quite tough and introduce risk unless they have a very legible track record, or you are sufficiently competent in their domain yourself to tell if they're likely to succeed. and if you're sufficiently competent to do the job of any of your report (even if less efficiently), then you can budget less risk here knowing that even if someone drops their ball you could always pick it up yourself.

Yeah, this seems like a good idea for reading - lets you get best of both worlds. Though it works for reading mostly because it doesn't take that much longer to do so. This doesn't translate as directly to e.g what to do when debugging code or running experiments.

"larger models exploit the RM more" is in contradiction with what i observed in the RM overoptimization paper. i'd be interested in more analysis of this

i've noticed a life hyperparameter that affects learning quite substantially. i'd summarize it as "willingness to gloss over things that you're confused about when learning something". as an example, suppose you're modifying some code and it seems to work but also you see a warning from an unrelated part of the code that you didn't expect. you could either try to understand exactly why it happened, or just sort of ignore it.

reasons to set it low:

  • each time your world model is confused, that's an opportunity to get a little bit of signal to improve your world model. if you ignore these signals you increase the length of your feedback loop, and make it take longer to recover from incorrect models of the world.
  • in some domains, it's very common for unexpected results to actually be a hint at a much bigger problem. for example, many bugs in ML experiments cause results that are only slightly weird, but if you tug on the thread of understanding why your results are slightly weird, this can cause lots of your experiments to unravel. and doing so earlier rather than later can save a huge amount of time
  • understanding things at least one level of abstraction down often lets you do things more effectively. otherwise, you have to constantly maintain a bunch of uncertainty about what will happen when you do any particular thing, and have a harder time thinking of creative solutions

reasons to set it high:

  • it's easy to waste a lot of time trying to understand relatively minor things, instead of understanding the big picture. often, it's more important to 80-20 by understanding the big picture, and you can fill in the details when it becomes important to do so (which often is only necessary in rare cases).
  • in some domains, we have no fucking idea why anything happens, so you have to be able to accept that we don't know why things happen to be able to make progress
  • often, if e.g you don't quite get a claim that a paper is making, you could resolve your confusion just by reading a bit ahead. if you always try to fully understand everything before digging into it, you'll find it very easy to get stuck before actually make it to the main point the paper is making

there are very different optimal configurations for different kinds of domains. maybe the right approach is to be aware that this is an important hparameter and occasionally try going down some rabbit holes and seeing how much value it provides

more importantly, both i and the other person get more out of the conversation. almost always, there are subtle misunderstandings and the rest of the conversation would otherwise involve a lot of talking past each other. you can only really make progress when you're actually engaging with the other person's true beliefs, rather than a misunderstanding of their beliefs.

Load More