David Johnston — LessWrong

NATO is dangerously unaware that its military edge is slipping

My view on where the tanks might win is: there's a point at which you basically saturate your capability at "whatever drones are good at" while there might be some other job tanks are good at (my vague guess is that this is something like "attacking well defended positions" - they're fast, take specialized weapons to defeat, and have big guns), and you're better off having that capability than further saturating your drone capability. But I've little in the way of quantitative insight about where saturation might occur, nor how good tanks are at attacking.

A particular point I'm a bit confused about: I've often seen people saying: tanks need infantry support to be safe. However, aren't infantry and tanks both vulnerable to drones?

NATO is dangerously unaware that its military edge is slipping

David Johnston3d42

I guess 5 Abrams and 30 million worth of drones vs 60 million worth of drones might be a better comparison. I think I’d still favour the drones but it’s much less obvious.

Natural emergent misalignment from reward hacking in production RL

David Johnston4d50

Some negative results. In some forthcoming work (out in the next few days, hopefully), we'll report negative results on trying to teach models to have "honest-only personas." That is, we tried to teach a model that, when a user query is prefixed with |HONEST_ONLY|, it responds in <honest_only> tags and only generates honest text; simultaneously, we trained the normal assistant persona to (1) acquire some knowledge but (2) lie about it. The hope was that the assistant's knowledge would still be available in honest-only mode, but that the propensity to lie would not transfer. Sadly, the dishonest propensity did transfer, and this method overall failed to beat a baseline of just training the assistant to be honest using the generic honesty data that we used to train the honest-only persona. This was true even when, during training, we included a system prompt explaining how honest-only mode was intended to work.

This is surprising to me, I would've expected it to work, maybe not perfectly, but there should be a significant difference. I'm less certain what my expectation would be for whether it beats your baseline - maybe "65% it beats baseline but not by a lot".

What about the inverse situation: untagged is honest, tagged is dishonest? The hypothesis here is something like: the unconditioned behaviour is the "true" persona (though I'm not very confident this would work: it'd be weird if propensity had asymmetric generalization properties but knowledge did not)

Buck's Shortform

David Johnston1mo100

I don't know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that "AGI" is much less likely to be insanely good at generalization than we thought in 2015.

Evidence is basically this: I don't think "the scaling hypothesis" was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren't expecting massive data scale-ups to be the road to AGI, what were they expecting instead? The alternative to reaching AGI by hyperscaling data is a world where we reach AGI with ... not much data. I have this picture which I associate with Marcus Hutter – possibly quite unfairly – where we just find the right algorithm, teach it to play a couple of computer games and hey presto we've got this amazing generally intelligent machine (I'm exaggerating a little bit for effect). In this world, the "G" in AGI comes from extremely impressive and probably quite unpredictable feats of generalization, and misalignment risks are quite obviously way higher for machines like this. As a brute fact, if generalization is much less predictable, then it is harder to tell if you've accidentally trained your machine to take over the world when you thought you were doing something benign. A similar observation also applies to most of the specific mechanisms proposed for misalignment: surprisingly good cyberattack capabilities, gradient hacking, reward function aliasing that seems intuitively crazy - they all become much more likely to strike unexpectedly if generalization is extremely broad.

But this isn't the world we're in; rather, we're in a world where we're helped along by a bit of generalization, but to a substantial extent we're exhaustively teaching the models everything they know (even the RL regime we're in seems to involve sizeable amounts of RL teaching many quite specific capabilities). Sample efficiency is improving, but the rate of progress in capability vs the rate of progress in sample efficiency looks to me like it's highly likely that we're in qualitatively the same world by the time we have broadly superhuman machines. I'd even be inclined to say: human level data efficiency is the upper bound of the point at which we reach broadly superhuman capability, because it's easy to feed machines much more (quality) data than it is to feed it to people, so by the time we get human level data efficiency we must have surpassed human level capability (well, probably).

Of course "super-AGI" could still end up hyper-data-efficient, but it seems like we're well on track to get less-generalizing and very useful AGI before we get there.

I know you're asking about goal structures and inductive biases, but I think generalization is another side of the same coin, and the thoughts above seem far simpler and thus more likely to be correct than anything I've ever thought specifically about inductive biases and goals. So I suppose my expectation is that correct thoughts about goal formation and inductive biases would also point away from 2015 era theories insofar as such theories predicted broad and unpredictable generalization, but I've little specific to contribute right now.

Early stage goal-directednesss

David Johnston1mo1-4

I think it's not in the IABIED FAQ because IABIED is focused on the relatively "easy calls"

IABIED says alignment is basically impossible

Cope Traps

Come on, I’m not doing this to you

Early stage goal-directednesss

David Johnston1mo30

It's helpful to know that we were thinking about different questions, but, like

There is some fact-of-the-matter about what, in practice, Sable's kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans.

[...]

It may not have a strong belief that it has any specific goals it wants to pursue, but it's got some sense that there are some things it wants that humanity wouldn't give it.

these are claims, albeit soft ones, about what kinds of goals arise, no?

Your FAQ argues theoretically (correctly) that the training data and score function alone don't determine what AI systems aim for. But this doesn't tell us we can't predict anything about End Goals full stop: it just says the answer doesn't follow directly from the training data.

The FAQ also assumes that AIs actually have "deep drives" but doesn't explain where they come from or what they're likely to be. This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it^[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?

Of course, if this mechanism ends up being not very important, we could get very different outcomes. ↩︎

Early stage goal-directednesss

David Johnston1mo32

I don't understand how this answers my question. I agree that if your heuristics are failing you're more likely to end up with surprising solutions, but I thought we were talking about end goals being random, not the means to achieving them. "Formulate the problem as a search" is an example of what I'd call a "robust heuristic"; I am claiming also that the goal of the problem-formulated-as-a-search is likely to be supplied by robust heuristics. This is completely compatible with the solution being in some respects surprising.

Early stage goal-directednesss

David Johnston1mo30

But once it starts thinking in a different language, and asking itself "okay, what's my goal?, how do I accomplish it?", more semirandom threads gain traction than previously could get traction.

From a commonsense point of view, one asks "what's my goal" when common heuristics are failing or conflicting, so you want to appeal to more robust (but perhaps costlier) heuristics to resolve the issue. So why do you expect heuristics to get more random here as capability increases? Perhaps it's something about training not aligning with common sense, but it seems to be that imitation, process supervision and outcome supervision would also favour appealing to more, not less, robust heuristics in this situation:

Imitation: because it's common sense
Process supervision: if process supervision addresses heuristic conflicts, it is desirable that they're resolved in a robust way and so appealing to more robust heuristics will be a success criterion in the rubric
Outcome supervision: should favour resolution by heuristics robustly aligned with "get high score on outcome measure"

Generalization and the Multiple Stage Fallacy?

[+]David Johnston2mo-50

Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

David Johnston2mo52

I share the sense that this article has many of the common shortcomings with other MIRI output and feel like maybe I ought to try a lot harder to communicate these issues, BUT I really don't think VNM rationality is the culprit here. I've not seen a compelling case that an otherwise capable model would be aligned or corrigible but for its taste for getting money pumped (I had a chat with Elliot T on twitter recently where he actually had a proposal along these lines ... but I didn't buy it).

I really think it's reasoning errors in how VNM and other "goal-directedness" premises are employed, and not VNM itself, that is problematic.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments