LESSWRONG
LW

309
David Johnston
580Ω3112140
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Lessons from Studying Two-Hop Latent Reasoning
David Johnston2d10

We did some related work: https://arxiv.org/pdf/2502.03490.

One of our findings was that with synthetic data, it was necessary to have e1->e2 as the first hop in some two-hop question and e2->e3 as the second hop in some two hop question in order to learn e1->e3. This differs from your finding with "natural" facts: if e2->e3 is a "natural" fact, then it plausibly does appear as a second hop in some of the pretraining data. But you find generalization even when they synthetic e1->e2 is present only by itself, so there seems to be a further difference between natural facts and synthetic facts that appear as second hops.

We also found that learning synthetic two hop reasoning seems to take about twice as many parameters (or twice as much "knowledge capacity") as learning only the one-hop questions from the same dataset, supporting the idea that, for transformers, learning to use a fact in either hop of a latent two-hop question requires something like learning that fact twice.

Did you try any experiments with a synthetic second hop instead of a synthetic first hop? It would be interesting to know whether "natural facts" can be composed flexibly with new facts or whether they can only be composed with new first hops. Our results suggest that there's a substantial cost to making facts latently composable, so I think it would be surprising if many facts were flexibly composable, especially if many of those facts were reasonably rare.

Reply
Training a Reward Hacker Despite Perfect Labels
David Johnston26d10

To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.

Reply
Thoughts on Gradual Disempowerment
David Johnston26d10

I didn’t mean it as a criticism, more as the way I understand it. Misalignment is a “definite” reason for pessimism - and therefore somewhat doubtful about whether it will actually play out. Gradual disempowerment is less definite about what actual form problems may take, but also a more robust reason to think there is a risk.

Reply
Linch's Shortform
David Johnston1mo10

That’s a good explanation of the distinction

Reply
Training a Reward Hacker Despite Perfect Labels
David Johnston1mo1-2

I share your general feelings about shard theory, but think you were being a bit too stingy with credit in this particular case.

Reply
Training a Reward Hacker Despite Perfect Labels
David Johnston1mo1-1

This seems different to “maximising rewards for the wrong reasons”. That view generally sees the reward maximised because it is instrumental for or aliased with the wrong goal. Here it’s just a separate behaviour that is totally unhelpful for maximising rewards but is learned as a reflex anyway.

Reply
Linch's Shortform
David Johnston1mo10

Given the apparent novelty of this interpretation, it doesn't actually obviate your broader thesis.

Reply
Linch's Shortform
David Johnston1mo52

Wait, wherefore is probably better translated as "for what reason" than "why". But this makes it much more sensible! Romeo Romeo, what makes you Romeo? Not your damn last name, that's for sure!

Reply11
Thoughts on Gradual Disempowerment
David Johnston1mo40

I see the gradual disempowerment story as a simple outside view flavoured reason why things could go badly for many people. I think it’s outside view flavoured because it’s a somewhat direct answer to “well things seems to have been getting better for people so far”. While, as you point out, misalignment seems to make the prospects much worse, it’s worth bearing in mind also that economic irrelevance of people also strongly supports the case for bad outcomes from misalignment. If people remained economically indispensable, even fairly serious misalignment could have non catastrophic outcomes.

Someone I was explaining it to described it as “indefinite pessimism”.

Reply
MIRI's "The Problem" hinges on diagnostic dilution
David Johnston1mo10

Sorry but this is nonsense. JBlack's comment shows the argument works fine even if you take a lot of trouble to construct P(count|Y) to give a better answer.

But this isn't even particularly important, because for your objection to stand, it must be impossible to find any situation where P(A|Y) would give you a silly answer, which is completely false.

Reply1
Load More
22MIRI's "The Problem" hinges on diagnostic dilution
1mo
23
7A brief theory of why we think things are good or bad
11mo
10
11Mechanistic Anomaly Detection Research Update
1y
0
6Opinion merging for AI control
2y
0
11Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?
Q
3y
Q
6
-1How likely are malign priors over objectives? [aborted WIP]
3y
0
8When can a mimic surprise you? Why generative models handle seemingly ill-posed problems
3y
4
3There's probably a tradeoff between AI capability and safety, and we should act like it
3y
3
3Is evolutionary influence the mesa objective that we're interested in?
3y
2
2[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness
3y
0
Load More