leogao

Sequences

Alignment Stream of Thought

Wiki Contributions

Comments

Sorted by
leogao41

people love to find patterns in things. sometimes this manifests as mysticism- trying to find patterns where they don't exist, insisting that things are not coincidences when they totally just are. i think a weaker version of this kind of thinking shows up a lot in e.g literature too- events occur not because of the bubbling randomness of reality, but rather carry symbolic significance for the plot. things don't just randomly happen without deeper meaning.

some people are much more likely to think in this way than others. rationalists are very far along the spectrum in the "things just kinda happen randomly a lot, they don't have to be meaningful" direction.

there are some obvious cognitive bias explanations for why people would see meaning/patterns in things. most notably, it's comforting to feel like we understand things. the idea of the world being deeply random and things just happening for no good reason is scary.

but i claim that there is something else going on here. I think an inclination towards finding latent meaning is actually quite applicable when thinking about people. people's actions are often driven by unconscious drives to be quite strongly correlated with those drives. in fact, unconscious thoughts are often the true drivers, and the conscious thoughts are just the rationalization. but from the inside, it doesn't feel that way; from the inside it feels like having free will, and everything that is not a result of conscious thought is random or coincidental. this is a property that is not nearly as true of technical pursuits, so it's very reasonable to expect a different kind of reasoning to be ideal.

not only is this useful for modelling other people, but it's even more useful for modelling yourself. things only come to your attention if your unconscious brain decides to bring them to your attention. so even though something happening to you may be a coincidence, whether you focus on it or forget about it tells you a lot about what your unconscious brain is thinking. from the inside, this feels like things that should obviously be coincidence nonetheless having some meaning behind them. even the noticing of a hypothesis for the coincidence is itself a signal from your unconscious brain.

I don't quite know what the right balance is. on the one hand, it's easy to become completely untethered from reality by taking this kind of thing too seriously and becoming superstitious. on the other hand, this also seems like an important way of thinking about the world that is easy for people like me (and probably lots of people on LW) to underappreciate.

leogao3212

in some way, bureaucracy design is the exact opposite of machine learning. while the goal of machine learning is to make clusters of computers that can think like humans, the goal of bureaucracy design is to make clusters of humans that can think like a computer

leogao142

The o1 public documentation neither confirms nor denies whether process based supervision was used.

leogao1310

It seems pretty reasonable that if an ordinary person couldn't have found the information about making a bioweapon online because they don't understand the jargon or something, and the model helps them understand the jargon, then we can't blanket-reject the possibility that the model materially contributed to causing the critical harm. Rather, we then have to ask whether the harm would have happened even if the model didn't exist. So for example, if it's very easy to hire a human expert without moral scruples for a non-prohibitive cost, then it probably would not be a material contribution from the model to translate the bioweapon jargon.

leogao106

Basically agree - I'm generally a strong supporter of looking at the loss drop in terms of effective compute. Loss recovered using a zero-ablation baseline is really quite wonky and gives misleadingly big numbers.

I also agree that reconstruction is not the only axis of SAE quality we care about. I propose explainability as the other axis - whether we can make necessary and sufficient explanations for when individual latents activate. Progress then looks like pushing this Pareto frontier.

leogao142

Extremely valid, you've convinced me that atom is probably a bad term for this

leogao6-4

I like the word "atom" to refer to units inside an SAE

leogao176

Keep in mind that if, hypothetically, there were major compute efficiency tricks to be had, they would likely not be shared publicly. So the absence of publicly known techniques is not strong evidence in either direction.

Also, in general I start from a prior of being skeptical of papers claiming their models are comparable/better than GPT-4. It's very easy to mislead with statistics - for example, human preference comparisons depend very heavily on the task distribution, and how discerning the raters are. I have not specifically looked deeply into Llama 405B though.

leogaoΩ6102

This is likely not the first instance, but OpenAI was already using the word "aligned" in this way in 2021 in the Codex paper.

https://arxiv.org/abs/2107.03374 (section 7.2)

leogao73

investment in anything speculative, including alignment, and AGI research, is likely to decrease if the economy is not doing great

Load More