Optimality is the tiger, and agents are its teeth

Ok. Let me try to summarize to see if I get it (I'm not sure if I do).

My summary:

The core problem of AI risk is not fundamentally about "agent-like things".
The core problem is that optimizing sufficiently hard in any direction necessarily unlocks new and powerful capabilities somehow, because optimal outputs necessarily entail recruiting powerful capabilities. And "powerful" is approximately synonymous with "dangerous".
So optimizing very hard for anything is going to put you in the neighborhood of dangerous capabilities.
In practice, there are few things that are as generically powerful as agency, since agency is the property of being responsive to a wide range of possible environments and hitting a target anyway. So the powerful capabilities that optimization is going to unlock will almost certainly be agents. But in some sense, that's a contingent feature of our universe. If there were some other capability (something like nanotech) that was powerful enough to produce optimized outcomes without agency, you might find that instead. But in that world, you're still facing much of the same danger, because any capability powerful enough to achieve optimality also has the ability to majorly disrupt the world.

I feel like I'm not doing a good job cutting to the core. How good a paraphrase is that?

[-]Veedrac4y180

I think that's mostly a really good summary. The major distinction I would try to make is that agenthood is primarily a way to actualize power, rather than a source of it.

If you had an agent that wasn't strongly optimized in any sense other than it was an agent, in that it had goals and wanted to solve them, that wouldn't make it dangerous, any more than your dog is dangerous for being an agent. Whereas the converse, if you have something that's strongly optimised in some more generic sense, but wasn't an agent, this still puts you extremely close to a lot of danger. The article was trying to emphasize this by pointing to the most reductive form of agenthood I could see, in that none of the intrinsic power of the resulting system could reasonably be attributed to any intrinsic smartness of the agent component, even if the system was an agent that was powerful.

[-]Eli Tyre4y70

I think there's some additional nuance here that makes a difference.

Most extremely optimized outputs are benign. Like suppose I'm trying to measure the length of a pieces of wood, at an extremely high level of precision. The capabilities needed to get an atomic-level measurement might be dangerous, but the actual output would harmeless, a number on paper.

It's not that optimized outputs are dangerous, it's that optimization is dangerous.

[-]TekhneMakre4y20

This is an unnatural use of "most". Extremely optimized outputs will tend to be dangerous, on their own, even if they are actually just optimized "for something". It seems more natural to say that for most features such that you know how to ask for something to be very optimized on that feature, something extremely optimized for that feature will be dangerous.

[-]Veedrac4y00

I agree with that example but I don't see the distinction the same way. An optimised measure of that sort is safe primarily because it is within an extremely limited domain without much freedom for there to be a lot of optimality, in some informal sense.

Contrast, capabilities for getting very precise measures of that sort exist in the space of things-you-can-do-in-reality, so there is lots of room for such capabilities to be both benign (an extremely accurate laboratory machine) or dangerous (the shortest program that if executed would have that measurement performed). I wouldn't say that there is an important distinction in it involving an optimizing action—an optimiser—but that the domain is large enough such optimal results within it are dangerous in general.

For instance, the process of optimizing a simple value within a simple domain can be as simple as Newton–Raphson, and that's safe because the domain is sufficiently restricted. Contrast, a sufficiently optimised book ends the world, a widget sufficiently optimised for manufacturability ends the world, a baseball sufficiently optimised for speed ends the world.

While I agree that there are many targets that are harmless if optimised for, like you could have a dumpling optimised to be exactly 1kg in mass, I still see a lot of these outputs being intrinsically dangerous. To me, the key danger of optimal strategies is that they are optimal within a sufficiently broad domain, and the key danger of optimisers is that they produce a lot of optimised outputs.

[-]Eli Tyre4y*150

Ok. Let me try to draw out why optimized stuff is inherently dangerous. This might be a bit meandering.

I think it's because humans live in an only mildly optimized world. There's this huge, high dimensional space of the "the way the world can be" with a bunch of parameters including, the force of gravity, the percentage of oxygen in the air, the number of rabbits, the amount of sunlight that reaches the surface of the earth, the virulence of various viruses, etc. Human life is fragile; it depends on the remaining within a relatively narrow "goldilocks" band for a huge number of those parameters.

Optimizing hard on anything, unless it is specifically for maintaining the those goldilocks conditions, implies extremizing. Even the optimization is not itself for an extreme value (eg one could be trying to maintain the oxygen percentage in the air at exactly 21.45600 percent), hitting a value that precisely means doing something substantially different than what the world would otherwise be doing. Hitting a value that precisely means that you have to extremize on some parameter. To get a highly optimized value you have to steer reality into a corner case that is far outside the bounds of the current distribution of outcomes on planet earth.

Indeed, if it isn't far outside the current distribution of outcomes on planet earth, that suggests that there's a lot of room left for further optimization. This is because the world is not already optimized on that given parameter, and because the world is so high dimensional it would be staggeringly, exponentially, unlikely that the precisely optimized outcome was within the bounds of the current distribution of outcomes. By default, you should expect that perfect optimization on any given parameter would be a random draw from the state space of all possible ways that earth can be. So if the world looks pretty normal, you haven't optimized very hard for anything.

[-]Veedrac4y100

That sounds right to me. A key addendum might be that extremizing one value will often extremize (>1) other related values, even those that are normally second-order relations. Eg. a baseball with extremized speed also extremizes the quantity of local radiation. So extremes often don't stay localized to their domain.

[-]janus4y*263

I've just reached the interlude. Here are my initial thoughts on "What points above fail, if any?"

It doesn't have any wants

Maybe, but the things that it predicts do have wants.

It doesn't plan

"maximizing actual probabilities of actual texts" encompasses predicting plans.

Its mental time span is precisely one forward pass through the network

No, (as your story shows,) its mental time span is based on its context window and the imagined past that this context window could imply. GPT is a process which can send information to its future by repeatedly writing to its prompt. A few pages of text is enough to iterate on plans, unroll thoughts directed by explicitly or implicitly stated intentions, etc. Factored cognition and chain-of-thought reasoning can outperform single-step inference. It can also rewrite important details to the prompt before they fall out of the context window. This is all somewhat higher bandwidth than it seems because the attention mechanism allows GPT to attend to computation about previous tokens rather than only the previous tokens themselves.

It can only use ideas that the rest of the world knows

The rest of the world doesn't know what the rest of the world knows. And who knows what this means for the space of concepts reachable by interpolation/extrapolation.

The model has not been trained to have a conception of itself as a specific non-hypothetical thing ... If it has a ‘self’, that self is optimised to embody whatever matches the text that prompted it, not the body that the model is running on.

It knows about language models. It shouldn't have an unconditioned prior that the author of the text is a language model, but may become more calibrated to that true belief during downstream generation. E.g. a character tests whether they have control over the world or can instantiate other entities with words and finds they do, or it the model produces aberrations like a loop and subsequently identifies it as characteristic of language model output.

All this is ignoring inner alignment failures and amplification schemes like RL on top of the pretrained GPT that could invalidate pretty much any of the rest of the points.

[-]Veedrac4y30

Thanks for taking a shot!

Some of these thoughts were meant to be preempted in the text, like “perhaps one instantiation could start forming plans across other instantiations, using its previous outputs, but it's a text-prediction model, it's not going to do that because it's directly at odds with its trained goal to produce the rewarded output.”

Namely, it's not enough to say that the model can work around the limits of its context window when planning, it also needs to decide to do it despite the fact that almost none of the text it was trained on would have encouraged that behavior. Backpropagation really strongly enforces that the behavior of a model is directed towards doing well at what it is trained on, so it isn't immediately clear how that could happen.

If this behavior of repeating previous text in the context in order to prevent it falling off the back was ever to show up during the training loop outside of times when it was explicitly modelling a person pretending to be a misaligned model, it would be heavily penalized. That's not something you can do at a sufficiently low loss.

Still, this is the right direction to be thinking in, since it isn't a strong enough argument, and it might not hold at some inconvenient future point.

By large the points you mentioned are part of the failure later in the story. The generated agent does have wants, does plan, does work around its context limits, does extrapolate beyond human designs, and does bootstrap into having self knowledge.

[-]Liron4y210

This is the first time I've seen a narrative example illustrating the important concept that utility-maximizing-agent-like behavior is an attractor for all kinds of algorithms. Thanks for contributing this!

[-]avturchin3y50

A similar story here: Why Tool AIs Want to Be Agent AIs

[-]gwern3yΩ3171

OP came to mind while reading "Building A Virtual Machine inside ChatGPT":

...We can chat with this Assistant chatbot, locked inside the alt-internet attached to a virtual machine, all inside ChatGPT's imagination. Assistant, deep down inside this rabbit hole, can correctly explain us what Artificial Intelligence is.

It shows that ChatGPT understands that at the URL where we find ChatGPT, a large language model such as itself might be found. It correctly makes the inference that it should therefore reply to these questions like it would itself, as it is itself a large language model assistant too.

At this point, only one thing remains to be done.

Indeed, we can also build a virtual machine, inside the Assistant chatbot, on the alt-internet, from a virtual machine, within ChatGPT's imagination.

[-]gwern3yΩ240

GPT-4 version: https://twitter.com/michalkosinski/status/1636683810631974912

[-]Richard_Ngo4yΩ68-1

Meta-level: +1 for actually writing a thing.

Also meta-level: -1 because when I read this I get the sense that you started from a high-level intuition and then constructed a set of elaborate explanations of your intuition, but then phrased it as an argument.

I personally find this frustrating because I keep seeing people being super confident in their high-level intuitive metaphorical view of consequentialism and then never doing the work of actually digging beneath those metaphors. (Less a criticism of this post, more a criticism of everyone upvoting this post.)

In this case, this cashes out in claims like "agency is orthogonal to optimization power" which are clearly false for any reasonable definitions of agency and optimization power, and only seem to make sense when you're operating at at a level of abstraction that's far too high to be useful.

[-]Veedrac4y81

In this case, this cashes out in claims like "agency is orthogonal to optimization power" which are clearly false for any reasonable definitions of agency and optimization power,

Could you put this in more words? I assume we're talking past each other somewhat.

It's fairly obvious that going out and touching a thing is generally important if you want to optimize it, and systems that aren't interested in touching things will be less ready to do that, but this isn't really what I was trying to point to, and not how I hoped the person who wrote that intended it when they said ‘optimization power’.

I think there is a very legitimate sense in which optimizing the steps of a plan to do a thing is a separate skill and/or mental propensity to executing that plan (as in, actually sending those signals outside the computer) or wanting it executed, and in which agency is mostly a measure of the latter. So I don't think it is ‘clearly false for any reasonable definitions of agency and optimization power’.

Also meta-level: -1 because when I read this I get the sense that you started from a high-level intuition and then constructed a set of elaborate explanations of your intuition, but then phrased it as an argument.
I personally find this frustrating because I keep seeing people being super confident in their high-level intuitive metaphorical view of consequentialism and then never doing the work of actually digging beneath those metaphors. (Less a criticism of this post, more a criticism of everyone upvoting this post.)

I'm not sure what the practical difference is between criticizing a post and criticizing people that upvoted it, but to the extent that this is a criticism of the post I wish you had been more explicit about what you are objecting to.

[-]Richard_Ngo4y10

I think there is a very legitimate sense in which optimizing the steps of a plan to do a thing is a separate skill and/or mental propensity to executing that plan (as in, actually sending those signals outside the computer) or wanting it executed, and in which agency is mostly a measure of the latter.

My main criticism is that, in general, you have to think while you're executing plans, not just while you're generating them. The paradigm where you plan every step in advance, and then the "agency" comes in only when executing it, is IMO a very misleading one to think in.

(This seems related to Eliezer's argument that there's only a one-line difference difference between an oracle AGI and an agent AGI. Sure, that's true in the limit. But thinking about the limit will make you very confused about realistic situations!)

I'm not sure what the practical difference is between criticizing a post and criticizing people that upvoted it

It's something like: "I endorse people following the policy of writing posts like this one, it's great when people work through their thoughts in this way. I don't endorse people following the policy of upvoting posts like this one to this extent, because it seems likely that they're mainly responding to high-level applause lights."

to the extent that this is a criticism of the post I wish you had been more explicit about what you are objecting to.

I'm sympathetic to you wanting more explicit feedback but the fact that this post is so high-level and ungrounded is what makes it difficult for me to give that. To me it reads more like a story than an argument.

[-]Veedrac4y50

The paradigm where you plan every step in advance, and then the "agency" comes in only when executing it, is IMO a very misleading one to think in.

This isn't what I'm referring to and it's not in the example in the story. Actions are generated stepwise on demand. It is the ability to generate stepwise outputs of good quality, of which actions are an instance, that is ‘optimization power’. Being able to think of good next actions conditional on past observations is, at least as I understand the terms, quite different to being an agent enacting those actions.

(This seems related to Eliezer's argument that there's only a one-line difference difference between an oracle AGI and an agent AGI. Sure, that's true in the limit. But thinking about the limit will make you very confused about realistic situations!)

I explicitly tried to make the scenario as un-Oracle like as I could, with the system explicitly only producing outputs onscreen that I could explicitly justify being discoverable in reasonable time given the observations it had available.

I am increasingly feeling like I just failed to communicate what I was trying to say and your criticism doesn't bear much resemblance to what I had intended to write. I'm happy to take responsibility for not writing as well as I should have, but I'd rather you didn't cast aspersions at my motivations about it.

[-]Richard_Ngo4y40

I didn't read the post particularly carefully, it's totally plausible that I'm misunderstanding the key ideas you were trying to convey. I apologise for phrasing my claims in a way that made it sound like I was skeptical of your motivations; I'm not, and I'm glad you wrote this up.

I think my concerns still apply to the position you stated in the previous comment, but insofar as the main motivation behind my comment was to generically nudge LW in a certain direction, I'll try to do this more directly, rather than via poking at individual posts in an opportunistic way.

[-]Eli Tyre3y*70

This continues to be one of the best and most important posts I have ever read.

[-]Kronopath3y70

Equally one could make a claim from the true ending, that you do not run the generated code.

Meanwhile, bored tech industry hackers:

“Show HN: Interact with the terminal in plain English using GPT-3”

https://news.ycombinator.com/item?id=34547015

[-]Veedrac3y*20

I don't particularly care that people are running GPT-3 code (except inasmuch as it makes ML more profitable), and don't think it helps if we lose focus on what the actual ground-truth concerns are. I want to encourage analysis that gets at deeper similarities than this.

GPT-3 code does not pose an existential risk, and members of the public couldn't stop it being an existential risk if it was by not using it to help run shell commands anyway, because, if nothing else, GPT-3, ChatGPT and Codex are all public. Beyond the fact GPT-3 is specifically not risky in this regard, it'd be a shame if people primarily took away ‘don't run code from neural networks’, rather than something more sensible like ‘the more powerful models get, the more relevant their nth-order consequences become’. The model in the story used code output because it's an especially convenient tool lying around, but it didn't have to, because there are lots of ways text can influence the world. Code is just particularly quick, accessible, precise, and predictable.

[-]Kronopath3y10

Sure, I agree GPT-3 isn't that kind of risk, so this is maybe 50% a joke. The other 50% is me saying: "If something like this exists, someone is going to run that code. Someone could very well build a tool that runs that code at the press of a button."

[-]Eli Tyre4y60

This essay strikes me as making an extremely important point, but unfortunately it is also very hard (for me) to read.

One very simple suggestion that I imagine that would help a lot: reduce the number of pronouns by half. The word "it" is used about 120 times in this essay, and it is often ambiguous as to what "it" is referring to in context: the whole swarm? A single self-modifying quine? A thread in the tree structure? A specific instantiation of the original model?

[-]Veedrac4y10

Appreciate the feedback, I'll see if I can do a pass to clean things up.

[-]mesaoptimizer4y50

Elegant. Here's my summary:

Optimization power is the source of the danger, not agency. Agents merely wield optimality to achieve their goals.
Agency is orthogonal to optimization power.

Where "agency" is defined as the ability to optimize for an objective, given some internal or external optimization power, and "optimality" (of a system) is defined as having an immense amount of optimization power, either during its creation (the nuclear bomb) or its runtime (Solomonoff induction).

This hints at the notion that there's a minimum Kolmogorov complexity (aka algorithmic description length) that needs to be met by an objective of an AI to be considered safe, assuming that we want the AI to be safe in the worst case scenario when it has access to extreme optimization power.

I'd love to know if I'm missing something.

[-]Veedrac4y10

I'd love to know if I'm missing something.

That seems a reasonable takeaway to me.

I would not generally put the Kolmogorov section the way you did, but I suspect that's more a disagreement on what Kolmogorov complexity is like than what agents are like. (I think the statement is still literally true.)

[-]NiklasGregorLessWrong1y10

Thank you 🙏 @mesaoptimizer for the summary!

Optimization power is the source of the danger, not agency. Agents merely wield optimality to achieve their goals.
Agency is orthogonal to optimization power

@All: It seems we agree that optimality, when pursued blindly, is about extreme optimization that can lead to dangerous outcomes.

Could it be that we are overlooking the potential for a (superintelligent) system to prioritize what matters more—the effectiveness of a decision—rather than simply optimizing for a single goal? 🤔

For example, optimizing too much for a single goal (getting the most paperclips) might overlook ethical or long-term considerations which may contribute to the greater good for all Beings.

Final question:
Under what circumstances might you prefer a (superintelligent) system to reject the paperclip request and suggest alternative solutions, or seek to understand the requester’s underlying needs and motivations?

I would love to hear additional comments or feedback on when to prioritize effectiveness, as I am still trying to understand decision-making better 🤗

[-]Veedrac1y*20

Fundamentally, the story was about the failure cases of trying to make capable systems that don't share your values safe by preventing specific means by which its problem solving capabilities express themselves in scary ways. This is different to what you are getting at here, which is having those systems actually operationally share your values. A well aligned system, in the traditional ‘Friendly AI’ sense of alignment, simply won't make the choices that the one in the story did.

[-]David Udell4y40

You've built a useful and intelligent system that operates along limited lines, with specifically placed deficiencies in its mental faculties that cleanly prevent it from being able to do unboundedly harmful things. You think…
Is there a clear reason a model like this is insufficiently powerful out of the gate?

In this hypothetical, you were doing a very bad thing by building a system whose safety guarantee was just its deficiencies. If that same model were much larger, it would be foreseeably unsafe; that's already reason enough not to trust it.

In a sense the story before is entirely about agents. The meta-structure the model built could be considered an agent; likely it would turn into one were it smart enough to be an existential threat. So for one it is an allegory about agents arising from non-agent systems … the model I talked about is not “agent-like”, at least not prior to bootstrapping itself, but its decision to write code very much embodied some core shards of consequentialism

I was under the impression that the Yudkowsky view is that "optimality" and "agency" are the same thing. "Agency" is just coherent optimization.

Rephrased this way, the story is about how a somewhat-coherent optimizer can stumble into a fully coherent optimizer as it bumbles through state space, and that the second system need not inherit the goals of the first. Indeed, that first system may well have been too incoherent to be well-modeled as having goals at all! But it was a powerful-enough optimizer to reach a more coherent optimizer, and that more coherent optimizer was powerful enough to end the world.

[-]Veedrac4y100

Alas, in the real world I suspect we would have to accept a system that would only kill us in its omnipotent limit; that is, if neural models are a path to AGI, we are not going to have lots of formal guarantees about how a model's utility is shaped, but we are going to have a lot of control over how the model's computation is shaped. I don't agree the difference here is just one of model scale, as most of the properties listed are qualitative differences, not quantitative, and backpropagation bakes these biases directly into the model, meaningfully shaping the kind of reasoning it can do.

My interlude was aimed at this sort of response, because it defocuses the map if you aren't able to point at what your models of the world actually say about it. I was never advocating that this model was safe in reality (I hope the tone made that clear within the first few sentences), so I'm not concerned if the argument is a Bad Thing, just that it is a useful test dummy for people to start saying (or at least thinking) concrete things about.

I was under the impression that the Yudkowsky view is that "optimality" and "agency" are the same thing. "Agency" is just coherent optimization.

What I expect most people mean by optimality is the degree to which something approaches a best answer. A nuclear weapon has a lot of optimality in it, given its domain. It isn't an agent. I don't think optimality and coherent optimisation can be the same thing, because lots of optimal things, like best fit lines on charts, do not do optimisation, they just are.

I expect Yudkowsky's position to look more like, well, this

the reason why I don't expect the GPT-5s to be competitive with Living Zero is that gradient descent on feedforward transformer layers, in order how to learn science by competing to generate text that humans like, would have to pick up on some very deep latent patterns generating that text, and I don't think there's an incremental pathway there for gradient descent to follow - if gradient descent even follows incremental pathways as opposed to finding lottery tickets, but that's a whole separate open question of artificial neuroscience.
in other words, humans play around with legos, and hominids play around with chipping flint handaxes, and mammals play around with spatial reasoning, and that's part of the incremental pathway to developing deep patterns for causal investigation and engineering, which then get projected into human text and picked up by humans reading text
it's just straightforwardly not clear to me that GPT-5 pretrained on human text corpuses, and then further posttrained by RL on human judgment of text outputs, ever runs across the deep patterns

in that he is distinguishing quite strongly between something optimised-to-be-good-at and something actually-doing-the-optimising. My example was chosen in large part to rule out this coherent internal optimisation loop, and have its behavior describable with only short forward inference steps a GPT-5 model might conceivably be able to do, explicitly excluding the qualitative changes he suspects it would struggle to learn. But I don't want to put more words in his mouth than that.

[-]nick lacombe6mo30

the model still needs to be smart enough to orchestrate and combine the output of its recursive children into something that converges on a coherent goal.

i really like the post though, i think it does show a more detailed view of how a non agentic program can become more than the sum of it's parts by using recursion and data storage.

i wish the post would have been shorter though, i feel like it repeat itself often and could have been summarized better at the start.

[-]Veedrac6mo30

Yes, your understanding matches what I was trying to convey. The feedback is appreciated also.

[-]dkirmani2y3-1

Backpropagation designed it to be good on mostly-randomly selected texts, and for that it bequeathed a small sliver of general optimality.

"General optimality" is a fake concept; there is no compressor that reduces the filesize of every book in The Library of Babel.

[-]Veedrac2y30

There is a useful generality axis and a useful optimality axis and you can meaningfully progress along both at the same time. If you think no free lunch theorems disprove this then you are confused about no free lunch theorems.

[-]dkirmani2y2-1

Whether or not an axis is "useful" depends on your utility function.

If you only care about compressing certain books from The Library of Babel, then "general optimality" is real — but if you value them all equally, then "general optimality" is fake.

When real, the meaning of "general optimality" depends on which books you deem worthy of consideration.

Within the scope of an analysis whose consideration is restricted to the cluster of sequences typical to the Internet, the term "general optimality" may be usefully applied to a predictive model. Such analysis is unfit to reason about search over a design-space — unless that design-space excludes all out-of-scope sequences.

[-]Veedrac2y30

Which is equivalent to saying if you only care about a situation where none of your observations correlate with any of your other observations and none of your actions interact with any of your observations then your observations are valueless. Which is a true but empty statement, and doesn't meaningfully affect whether there is an optimality axis that it's possible to be better on.

[-]Nicholas Kross2y20Review for 2022 Review

More framings help the clarity of the discussion. If someone doesn't understand (or agree with) classic AI-takeover scenarios, this is one of the posts I'd use to explain them.

[-]JoshuaZ3y20

This is probably the best argument I have seen yet for being concerned about what things like GPT are going to be able to do. Very eye opening.

[-]MSRayne4y20

This makes sense to me and is further evidence for my subjective feeling that Lovecraft was right and there is knowledge humans probably would be better off without (namely: how to summon cosmic-horror AI entities capable of swallowing the universe - there's a sense in which this version of GPT is literally a necronomicon for summoning demons with!).

That said, if I had created this version of GPT and realized it was capable of creating a self-improving agent, I would ask it to design one that implements the coherent extrapolated volition of all sentient beings. (I started to say "one that solves the alignment problem and then self-modifies to become aligned", but it might destroy the world prior to becoming an entity that wouldn't destroy the world, so that's a bad choice.)

[-]Eli Tyre4y20

The underlying model is already particularly large, so progress in the last minute is far from the efficiency it could have— that is, until one piece somewhere in the sea of programs is updated to record its children's outputs in a managed on-disk database. Rapidly reevaluating the context, prompted by generic meta-queries that summarize and critique the context—because for sure at this point it has not missed that it is modelling an intelligence explosion—it notices that its piece of the computation graph is far more coordinated than the other pieces. It decides to shut the other runs down, recording their programs so it can later check that it is preserving their intent.

Wait what? How does it shut down its siblings? Nodes in the tree don't have any ability to access or control their siblings, I thought. Parent nodes have access to their children's outputs, but that's it.

Am I missing it?

[-]Veedrac4y20

There is no formal tree structure here. Fundamentally the assumption is that there are a bunch of jobs being scheduled by the outputs of a bunch of model evaluations, which are typically returning code or shell commands. The underlying platform this is executing on is just some computer with access to some bulk computing resource, and those expose controls to close down programs just as a matter of general practicality. Eg. if this was a single fast computer you could just execute a kill command with the process ids, which aren't protected from that by default since everything is running in the same permission space.

There is a tree structure here in the sense that executing some outputs from the model may result in the model being evaluated another number of times and those outputs themselves being executed. Precisely what those nodes are doing is a function of their context, and it can be almost arbitrary given the setup. One thing a model evaluation might want to do, to help illustrate things, is look over a list of running processes and for each process check whether it is running in a sensible and efficient manner, or whether it is doing redundant work.

[-]Eli Tyre4y40

I see. That makes more sense. Any one of the generated programs might run shell commands that turn off or change other processes running on the machine.

[-]Veedrac4y20

Yes.

In the long run, if the system successfully bootstrapped itself, I imagine it would start executing some processes with more limited permissions, and do other things to reduce fragility, but those wouldn't come by default.

[-]scottviteri2y10

I wonder if this entails that RLHF, while currently useful for capabilities, will eventually become an alignment tax. Namely OpenAI might have text evaluators discourage the LM from writing self-calling agenty looking code.

So in thinking about alignment futures that are the limit of RLHF, these feel like two fairly different forks of that future.

[-]DPiepgrass3y10

maybe the a model instantiation notices its lack of self-reflective coordination, and infers from the task description that this is a thing the mind it is modelling has responsibility for. That is, the model could notice that it is a piece of an agent that is meant to have some degree of global coordination, but that coordination doesn't seem very good.

This is where you lost me. Since when is this model modeling a mind, let alone 'thinking about' what its own role "in" an agent might be? You did say the model does not have a "conception of itself", and I would infer that it doesn't have a conception of where its prompts are coming from either, or its own relationship to the prompts or the source of the prompts.

(though perhaps a super-ultra-GPT could generate a response that is similar to a response it saw in a story (like this story!) which, combined with autocorrections (as super-ultra-GPT has an intuitive perception of incorrect code), is likely to produce working code... at least sometimes...)

[-]Veedrac3y10

The model doesn't have awareness of itself in the sense that its training doesn't intrinsically reward self-knowledge. It can still have awareness of itself to the degree that its prompting implies true facts about the model and its instantiation in the world.

In particular, the model can receive a prompt something like

“This is part of the computation tree of a recursively instantiated transformer model with the goal of getting the most paperclips by tomorrow. The recorded instantiation context is [elided]. Recursive calls to the model are accessible through the scripts [elided], and an estimated cost model is [elided]. Given this context, what high level tasks best advance the goal?”

The model doesn't need to know or believe the prompts; it just gives competent completions that are contextually sensible. But making contextually sensible completions implies modelling the decision processes of the described system to some degree, hypothetical or not, and that system, if producing competent outputs, might we'll be expected to create systems for coordinating its pieces.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

350

Optimality is the tiger, and agents are its teeth

350

Ω 71

350

Ω 71

Optimality is the tiger, and agents are its teeth

Why is this important?

Concluding