Sorted by New

Wiki Contributions


(9) is a values thing, not a beliefs thing per se. (I.e. it's not an epistemic claim.)

(11) is one of those claims that is probabilistic in principle (and which can be therefore be updated via evidence), but for which the evidence in practice is so one-sided that arriving at the correct answer is basically usable as a sort of FizzBuzz test for rationality: if you can’t get the right answer on super-easy mode, you’re probably not a good fit.

Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:

The way I think about it is something like: a "goal representation" is basically what you get when it's easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.

In principle, this doesn't have to equate to "goals" in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (and because) permitting longer horizons (in the sense of increasing the length of the minimal sequence needed to reach some terminal state) causes the intervening trajectories to explode in number and complexity, s.t. it's hard to impose meaningful constraints on those trajectories that don't map to (and arise from) some much simpler description of the outcomes those trajectories lead to.

This connects with the "reasoners compress plans" point, on my model, because a reasoner is effectively a way to map that compact specification on outcomes to some method of selecting trajectories (or rather, selecting actions which select trajectories); and that, in turn, is what goal-oriented reasoning is. You get goal-oriented reasoners ("inner optimizers") precisely in those cases where that kind of mapping is needed, because simple heuristics relating to the trajectory instead of the outcome don't cut it.

It's an interesting question as to where exactly the crossover point occurs, where trajectory-heuristics don't function as effectively as consequentialist outcome-based reasoning. On one extreme, there are examples like tic-tac-toe, where it's possible to play perfectly based on a myopic set of heuristics without any kind of search involved. But as the environment grows more complex, the heuristic approach will in general be defeated by non-myopic, search-like, goal-oriented reasoning (unless the latter is too computationally intensive to be implemented).

That last parenthetical adds a non-trivial wrinkle, and in practice reasoning about complex tasks subject to bounded computation does best via a combination of heuristic-based reasoning about intermediate states, coupled to a search-like process of reaching those states. But that already qualifies in my book as "goal-directed", even if the "goal representations" aren't as clean as in the case of something like (to take the opposite extreme) AIXI.

To me, all of this feels somewhat definitionally true (though not completely, since the real-world implications do depend on stuff like how complexity trades off against optimality, where the "crossover point" lies, etc). It's just that, in my view, the real world has already provided us enough evidence about this that our remaining uncertainty doesn't meaningfully change the likelihood of goal-directed reasoning being necessary to achieve longer-term outcomes of the kind many (most?) capabilities researchers have ambitions about.


It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

Here's an existing Nate!comment that I find reasonably persuasive, which argues that these two things are correlated in precisely those cases where the outcome requires routing through lots of environmental complexity:

Part of what's going on here is that reality is large and chaotic. When you're dealing with a large and chaotic reality, you don't get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to "unroll" that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like "if the experiments come up this way, then I'll follow it up with this experiment, and if instead it comes up that way, then I'll follow it up with that experiment", and etc. This decision tree quickly explodes in size. And even if we didn't have a memory problem, we'd have a time problem -- the thing to do in response to surprising experimental evidence is often "conceptually digest the results" and "reorganize my ontology accordingly". If you're trying to unroll that reasoner into a decision-tree that you can write down in advance, you've got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.

Reasoners are a way of compressing plans, so that you can say "do some science and digest the actual results", instead of actually calculating in advance how you'd digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)

Like, you can't make an "oracle chess AI" that tells you at the beginning of the game what moves to play, because even chess is too chaotic for that game tree to be feasibly representable. You've gotta keep running your chess AI on each new observation, to have any hope of getting the fragment of the game tree that you consider down to a managable size.

Like, the outputs you can get out of an oracle AI are "no plan found", "memory and time exhausted", "here's a plan that involves running a reasoner in real-time" or "feed me observations in real-time and ask me only to generate a local and by-default-inscrutable action". In the first two cases, your oracle is about as useful as a rock; in the third, it's the realtime reasoner that you need to align; in the fourth, all [the] word "oracle" is doing is mollifying you unduly, and it's this "oracle" that you need to align.

Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like "go to the moon" and that you will still be writing this kind of post even once AI systems have 10x'd the pace of R&D.)

Here's an existing Nate!response to a different-but-qualitatively-similar request that, on my model, looks like it ought to be a decent answer to yours as well:

a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.

Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.

(The original discussion that generated this example was couched in terms of value alignment, but it seems to me the general form "delete all discussion pertaining to some deep insight/set of insights from the training corpus, and see if the model can generate those insights from scratch" constitutes a decent-to-good test of the model's cognitive planning ability.)

(Also, I personally think it's somewhat obvious that current models are lacking in a bunch of ways that don't nearly require the level of firepower implied by a counterexample like "go to the moon" or "generate this here deep insight from scratch", s.t. I don't think current capabilities constitute much of an update at all as far as "want-y-ness" goes, and continue to be puzzled at what exactly causes [some] LLM enthusiasts to think otherwise.)

I think I'm not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V "inside" the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I'd expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn't regressional, and so V and X aren't independent.

(Consider e.g. two arbitrary functions U' and V', and compute the "error term" X' between them. It should be obvious that when U' is maximized, X' is much more likely to be large than V' is; which is simply another way of saying that X' isn't independent of V', since it was in fact computed from V' (and U'). The claim that the reward model isn't even "approximately correct", then, is basically this: that there is a separate function U being optimized whose correlation with V within-distribution is in some sense coincidental, and that out-of-distribution the two become basically unrelated, rather than one being expressible as a function of the other plus some well-behaved error term.)

(Which, for instance, seems true about humans, at least in some cases: If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."

I expect this sort of thing to be less common with AI systems that can have much bigger "cranial capacity". But then again, I guess that at whatever level of brain size, there will be some problems for which it's too inefficient to do them the "proper" way, and for which comparatively simple heuristics / values work better.

But maybe at high enough cognitive capability, you just have a flexible, fully-general process for evaluating the exact right level of approximation for solving any given problem, and the binary distinction between doing things the "proper" way and using comparatively simpler heuristics goes away. You just use whatever level of cognition makes sense in any given micro-situation.)

+1; this seems basically similar to the cached argument I have for why human values might be more arbitrary than we'd like—very roughly speaking, they emerged on top of a solution to a specific set of computational tradeoffs while trying to navigate a specific set of repeated-interaction games, and then a bunch of contingent historical religion/philosophy on top of that. (That second part isn't in the argument you [Eli] gave, but it seems relevant to point out; not all historical cultures ended up valuing egalitarianism/fairness/agency the way we seem to.)

It sounds like you're arguing that uploading is impossible, and (more generally) have defined the idea of "sufficiently OOD environments" out of existence. That doesn't seem like valid thinking to me.

Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn't weight by expected probability ( ie an incorrect distance function).

Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.

The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.

Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are "close" under that metric.

If you actually believe the sharp left turn argument holds water, where is the evidence?

As as I said earlier this evidence must take a specific form, as evidence in the historical record

Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?

And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?

But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven't actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.

It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!

No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect - and by that measure human brain evolution was an enormous unprecedented success according to evolutionary fitness.

The vector dot product model seems importantly false, for basically the reason sketched out in this comment; optimizing a misaligned proxy isn't about taking a small delta and magnifying it, but about transitioning to an entirely different policy regime (vector space) where the dot product between our proxy and our true alignment target is much, much larger (effectively no different from that of any other randomly selected pair of vectors in the new space).

(You could argue humans haven't fully made that phase transition yet, and I would have some sympathy for that argument. But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven't actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.)

It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn't make the sacrifice.

That doesn't seem like behavior we really want; depending on how closely together the "timesteps" are spaced, it could even wreck the agent's capabilities entirely, in the sense of no longer being able to optimize within button-not-pressed trajectories.

(It also doesn't seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory; humans don't appear to behave this way when making plans, for example. If I considered the possibility of dying at every instant between now and going to the store, and permitted myself only to take actions which Pareto-improve the outcome set after every death-instant, I don't think I'd end up going to the store, or doing much of anything at all!)

In your example, DSM permits the agent to end up with either A+ or B. Neither is strictly dominated, and neither has become mandatory for the agent to choose over the other. The agent won't have reason to push probability mass from one towards the other.

But it sounds like the agent's initial choice between A and B is forced, yes? (Otherwise, it wouldn't be the case that the agent is permitted to end up with either A+ or B, but not A.) So the presence of A+ within a particular continuation of the decision tree influences the agent's choice at the initial node, in a way that causes it to reliably choose one incomparable option over another.

Further thoughts: under the original framing, instead of choosing between A and B (while knowing that B can later be traded for A+), the agent instead chooses whether to go "up" or "down" to receive (respectively) A, or a further choice between A+ and B. It occurs to me that you might be using this representation to argue for a qualitative difference in the behavior produced, but if so, I'm not sure how much I buy into it.

For concreteness, suppose the agent starts out with A, and notices a series of trades which first involves trading A for B, and then B for A+. It seems to me that if I frame the problem like this, the structure of the resulting tree should be isomorphic to that of the decision problem I described, but not necessarily the "up"/"down" version—at least, not if you consider that version to play a key role in DSM's recommendation.

(In particular, my frame is sensitive to which state the agent is initialized in: if it is given B to start, then it has no particular incentive to want to trade that for either A or A+, and so faces no incentive to trade at all. If you initialize the agent with A or B at random, and institute the rule that it doesn't trade by default, then the agent will end up with A+ when initialized with A, and B when initialized with B—which feels a little similar to what you said about DSM allowing both A+ and B as permissible options.)

It sounds like you want to make it so that the agent's initial state isn't taken into account—in fact, it sounds like you want to assign values only to terminal nodes in the tree, take the subset of those terminal nodes which have maximal utility within a particular incomparability class, and choose arbitrarily among those. My frame, then, would be equivalent to using the agent's initial state as a tiebreaker: whichever terminal node shares an incomparability class with the agent's initial state will be the one the agent chooses to steer towards. which case, assuming I got the above correct, I think I stand by my initial claim that this will lead to behavior which, while not necessarily "trammeling" by your definition, is definitely consequentialist in the worrying sense: an agent initialized in the "shutdown button not pressed" state will perform whatever intermediate steps are needed to navigate to the maximal-utility "shutdown button not pressed" state it can foresee, including actions which prevent the shutdown button from being pressed.

Load More