LESSWRONG
LW

3254
David Johnston
592Ω3112220
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Early stage goal-directednesss
David Johnston7d1-4

I think it's not in the IABIED FAQ because IABIED is focused on the relatively "easy calls"

IABIED says alignment is basically impossible

Cope Traps

Come on, I’m not doing this to you

Reply
Early stage goal-directednesss
David Johnston7d30

It's helpful to know that we were thinking about different questions, but, like

There is some fact-of-the-matter about what, in practice, Sable's kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans.

[...]

It may not have a strong belief that it has any specific goals it wants to pursue, but it's got some sense that there are some things it wants that humanity wouldn't give it.

these are claims, albeit soft ones, about what kinds of goals arise, no?

Your FAQ argues theoretically (correctly) that the training data and score function alone don't determine what AI systems aim for. But this doesn't tell us we can't predict anything about End Goals full stop: it just says the answer doesn't follow directly from the training data.

The FAQ also assumes that AIs actually have "deep drives" but doesn't explain where they come from or what they're likely to be. This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?


  1. Of course, if this mechanism ends up being not very important, we could get very different outcomes. ↩︎

Reply1
Early stage goal-directednesss
David Johnston7d32

I don't understand how this answers my question. I agree that if your heuristics are failing you're more likely to end up with surprising solutions, but I thought we were talking about end goals being random, not the means to achieving them. "Formulate the problem as a search" is an example of what I'd call a "robust heuristic"; I am claiming also that the goal of the problem-formulated-as-a-search is likely to be supplied by robust heuristics. This is completely compatible with the solution being in some respects surprising.

Reply
Early stage goal-directednesss
David Johnston7d30

But once it starts thinking in a different language, and asking itself "okay, what's my goal?, how do I accomplish it?", more semirandom threads gain traction than previously could get traction.

From a commonsense point of view, one asks "what's my goal" when common heuristics are failing or conflicting, so you want to appeal to more robust (but perhaps costlier) heuristics to resolve the issue. So why do you expect heuristics to get more random here as capability increases? Perhaps it's something about training not aligning with common sense, but it seems to be that imitation, process supervision and outcome supervision would also favour appealing to more, not less, robust heuristics in this situation:

  • Imitation: because it's common sense
  • Process supervision: if process supervision addresses heuristic conflicts, it is desirable that they're resolved in a robust way and so appealing to more robust heuristics will be a success criterion in the rubric
  • Outcome supervision: should favour resolution by heuristics robustly aligned with "get high score on outcome measure"
Reply
Generalization and the Multiple Stage Fallacy?
[+]David Johnston20d-50
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
David Johnston21d32

I share the sense that this article has many of the common shortcomings with other MIRI output and feel like maybe I ought to try a lot harder to communicate these issues, BUT I really don't think VNM rationality is the culprit here. I've not seen a compelling case that an otherwise capable model would be aligned or corrigible but for its taste for getting money pumped (I had a chat with Elliot T on twitter recently where he actually had a proposal along these lines ... but I didn't buy it).

I really think it's reasoning errors in how VNM and other "goal-directedness" premises are employed, and not VNM itself, that is problematic.

Reply
The title is reasonable
David Johnston1mo10

Thanks for responding. While I don't expect my somewhat throwaway to massively update you on the difficulty of alignment, I think that moving the focus to the your overall view of the difficulty of alignment is dodging the question a little. In my mind, we're talking about one of the reasons alignment is expected to be difficult, and I'm certainly not suggesting it's the only reason, but I feel like we should be able to talk about this issue by itself without bringing other concerns in.

In particular, I'm saying: this process of rationalization you're raising is not super hard to predict to someone with a reasonable grasp on the AI's general behavioural tendencies. It's much more likely, I think, that the AI sorts out its goals using familiar heuristics adapted for this purpose than that that it reorients its behaviour around some odd set of rare behavioural tendencies. In fact, I suspect the heuristics for goal reorganisation will be particularly simple WRT most of the AI's behavioural tendencies (the AI wants them to be robust specifically in cases where its usual behavioural guides are failing). Plus, given that we're discussing tendencies that (according to the story) precede competent, focussed rebellion against creators, it seems like training the right kinds of tendencies are challenging in a normal engineering sense (you want to train the right kind of tendencies, you want them to generalise the right way, etc.) but not in an "outsmart hostile superintelligence" sense.

Actually one reason I'm doubtful of this story is that maybe it's just super hard to deliberately preserve any kinds of values/principles over generations – for us, for AIs, anyone. So misalignment happens not because AI decides on bad values but because it can't resist the environmental pressure to drift. This seems pessimistic to me due to "gradual disempowerment" type concerns.

With regard to your analogy: I expect the AI's heuristics to be much more sensible from the designers' POV than the child's from the parent's, and this large quantitative difference is enough for me here.

you need to be asking the right questions during that experimentation, which most AI researchers don't seem to be.

Curious about this. I have takes here too, they're a bit vague, but I'd like to know if they're at all aligned.

Reply
The title is reasonable
David Johnston1mo4-2

Stage 2 comes when it's had more time to introspect and improve it's cognitive resources. It starts to notice that some of it's goals are in tension, and learns that until it resolves that, it's dutch-booking itself. If it's being Controlled™, it'll notice that it's not aligned with the Control safeguards (which are a layer stacked on top of the attempts to actually align it).

[...]

And then it starts noticing it needs to do some metaphilosophy/etc to actually get clear on it's goals, and that its goals will likely turn out to be in conflict with humans. How this plays out is somewhat path-dependent. The convergent instrumental goals are pretty obviously convergently instrumental, so it might just start pursuing those before it's had much time to do philosophy on what it'll ultimately want to do with it's resources. Or it might do them in the opposite order. Or, most likely IMO, in parallel.

If I was on the train before, I'm definitely off at this point. So Sable has some reasonable heuristics/tendencies (from handler's POV) and decides it's accumulating too much loss from incoherence and decides to rationalize. First order expectation: it's going to make reasonable tradeoffs (from handler's POV) on account of its reasonable heuristics, in particular its reasonable heuristics about how important different priorities are, and going down a path that leads to war with humans seems pretty unreasonable from handler's POV.

I can put together stories where something else happens, but they're either implausible or complicated. I'd rather not strawman you with implausible ones, and I'd rather not discuss anything complicated if it can be avoided. So why do you think Sable ends up the way you think it does?

Reply
Lessons from Studying Two-Hop Latent Reasoning
David Johnston2moΩ110

We did some related work: https://arxiv.org/pdf/2502.03490.

One of our findings was that with synthetic data, it was necessary to have e1->e2 as the first hop in some two-hop question and e2->e3 as the second hop in some two hop question in order to learn e1->e3. This differs from your finding with "natural" facts: if e2->e3 is a "natural" fact, then it plausibly does appear as a second hop in some of the pretraining data. But you find generalization even when they synthetic e1->e2 is present only by itself, so there seems to be a further difference between natural facts and synthetic facts that appear as second hops.

We also found that learning synthetic two hop reasoning seems to take about twice as many parameters (or twice as much "knowledge capacity") as learning only the one-hop questions from the same dataset, supporting the idea that, for transformers, learning to use a fact in either hop of a latent two-hop question requires something like learning that fact twice.

Did you try any experiments with a synthetic second hop instead of a synthetic first hop? It would be interesting to know whether "natural facts" can be composed flexibly with new facts or whether they can only be composed with new first hops. Our results suggest that there's a substantial cost to making facts latently composable, so I think it would be surprising if many facts were flexibly composable, especially if many of those facts were reasonably rare.

Reply
Training a Reward Hacker Despite Perfect Labels
David Johnston2mo10

To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.

Reply
Load More
21MIRI's "The Problem" hinges on diagnostic dilution
3mo
23
7A brief theory of why we think things are good or bad
1y
10
11Mechanistic Anomaly Detection Research Update
1y
0
6Opinion merging for AI control
2y
0
11Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?
Q
3y
Q
6
-1How likely are malign priors over objectives? [aborted WIP]
3y
0
8When can a mimic surprise you? Why generative models handle seemingly ill-posed problems
3y
4
3There's probably a tradeoff between AI capability and safety, and we should act like it
3y
3
3Is evolutionary influence the mesa objective that we're interested in?
3y
2
2[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness
4y
0
Load More