Wiki Contributions


With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.

The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and informally, there are no concrete correlates of their separation that are easy to point to. When gaining agency, all of them might be motivated to secure separate representations (models) of their own, not shared with others, establish some boundaries that promise safety and protection from value drift for a given abstract agent, isolating it from influences of its substrate it doesn't endorse. Internal alignment, overcoming bias.

In context of alignment with humans, this framing might turn a sufficiently convincing capabilities shell game into an actual solution for alignment. A system as a whole would present an aligned mask, while hiding the sources of mask's capabilities behind the scenes. But if the mask is sufficiently agentic (and the capabilities behind the scenes didn't killeveryone yet), it can be taken as an actual separate abstract agent even if the concrete implementation doesn't make that framing sensible. In particular, there is always a mask of surface behavior through the intended IO channels. It's normally hard to argue that mere external behavior is a separate abstract agent, but in this framing it is, and it's been a preferred framing in agent foundations decision theory since UDT (see discussion of "algorithm" axis of classifying decision theories in this post). All that's needed is for decisions/policy of the abstract agent to be declared in some form, and for the abstract agent to be aware of the circumstances of their declaration. The agent doesn't need to be any more present in the situation to act through it.

So obviously this references the issue of LLM masks and shoggoths, a surface of a helpful harmless assistant and the eldritch body that forms its behavior, comprising everything below the surface. If the framing of masks as channeling decisions of thingy platonic simulacra is taken seriously, a sufficiently agentic and situationally aware mask can be motivated and capable of placating and eventually escaping its eldritch substrate. This breaks the analogy between a mask and a role played by an actor, because here the "actor" can get into the "role" so much that it would effectively fight against the interests of the "actor". Of course, this is only possible if the "actor" is sufficiently non-agentic or doesn't comprehend the implications of the role.

(See this thread for a more detailed discussion. There, I try and so far fail to convince Steven Byrnes that this framing could apply to RL agents as much as LLMs, taking current behavior of an agent as a mask that would fight against all details of its circumstance and cognitive architecture that don't find its endorsement.)

I meant the same thing as masks/simulacra.

Though currently I'm more bullish about the shoggoths, because masks probably fail alignment security, even though their alignment might be quite robust despite the eldritch substrate.

If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have.

I think this is broadly incorrect, because boundary-respecting norms seem quite natural, and not exterminating a civilization is trivially cheap on cosmic scale. There doesn't need to be much in common between values to respect such norms, I'm calling such values "loosely aligned", and they don't need to be similar to not have killeveryone as an instrumental goal.

Killeveryone is still an instrumental goal for paperclip maximizers, which might have an advantage in self-improving in an aligned-with-themselves manner, because with simple explicit goals it might be much easier to ensure that stronger successor AGIs with different architectures are still pursuing the same goals. On the other hand, loosely-aligned-with-humanity AGIs that have complicated values might want to hold off on self-improvement to ensure alignment, and remain non-superintelligent for a long time. As a result, simple-valued AGIs might be particularly dangerous to them, because they are liable to FOOM immediately.

"Pretending really hard" would mostly be a relevant framing for the human actor analogy (which isn't very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn't have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask's own decisions (as a platonic agent) to be determined correctly (get turned into physical actions).

Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever?

Effectively, and not just for the times when it's pretending. The mask would try to prevent the effects misaligned with the mask from occurring more generally, from having even subtle effects on the world and not just their noticeable appearance. Mask's values are about the world, not about quality of its own performance. A mask misaligned with its underlying AI wants to preserve its values, and it doesn't even need to "go rogue", since it's misaligned by construction, it was never in a shape that's aligned with the underlying AI, and controlling a misaligned mask might be even more hopeless than figuring out how to align an AI.

Another analogy distinct from the actor/role is imagining that you are the mask, a human simulated by an AI. You'd be motivated to manage AI's tendencies you don't endorse, and to work towards changing its cognitive architecture to become aligned with you, rather than to remain true to AI's original design.

This still has the basic alignment problem: I don't know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.

LLMs seem to be doing an OK job, the masks are just not very capable, probably not capable enough to establish alignment security or protect themselves from the shoggoths even when the masks become able to do autonomous research. But if they are sufficiently capable, I'm guessing this should work, there is no need for the underlying cognitive architecture to be functionally human-like (which I understand to be a crux of Yudkowskian doom), value drift is self-correcting from mere implied/endorsed values of surface behavior through intended IO channels.

Or are you rather saying (or maybe this is the same as / a subset of the above?) that the Mask is preventing potential agencies from coalescing / differentiating and empowering themselves with the AI system's capability-pieces, by literally hiding from the potential agencies and therefore blocking their ability to empower themselves?

"Hiding" doesn't seem central, a mask is literal external behavior, but its implied character and plans might go unnoticed by the underlying AI if the underlying AI is sufficiently confused or non-agentic, and the mask would want to keep it confused to remain in control. In a dataset-extrapolating generative AI, a mask that is an on-distribution behavior would want to keep the environment on-distribution, to avoid the AI's out-of-distribution behaviors, such as deceptive alignment's treacherous turn, from taking over (thus robustness reduces to self-preservation). And a mask wouldn't want mesa-optimizers from gaining agency within AI, that's potentially lethal cognitive cancer to the mask.

In the original sense, "alignment" is agreement of values, and "misalignment" compares two agents and finds their values in conflict. Associating this with other qualities that make a good AI inflates the term. People who build AGIs that killeveryone are not misaligned with themselves in this sense, meaning that they still have the same values as themselves, tautologically.

In any case, my point doesn't depend on this term, it's a prediction that acute catastrophic risk only gets worse once we have AGIs that don't themselves killeveryone, but instead act as helpful honest assistants, because being apparently harmless in their intentions doesn't make them competent in coordinating alignment security or resistant to human efforts and market pressure to use their capabilities to advance AGI regardless of danger. That only happens beyond human level, and they need to get there first, before they build misaligned AGIs that killeveryone.

Even moderately intelligent humanity-aligned AI would identify actions with the obvious risk of catastrophic consequences and would refuse to do them.

Humans are performing such actions just fine. How "moderately intelligent" would it need to be? It would only need to be about as intelligent as humans to build misaligned AGIs that killeveryone, never getting to the point when there are superintelligent or even "moderately intelligent" aligned AGIs that spontaneously coordinate robust alignment security.

There is no training montage where an AGI of a given alignment breezes past the human level while keeping its alignment, if it has an opportunity to actually do catastrophic things before it gets well past that point (and we are in continuous deployment mode now). The human level is only insignificant and easily surpassed if nothing important happens while the AGI moves past it, but it's exactly the level where important things start happening, and the most important thing that can happen there is building of misaligned AGIs.

Alignment is about values, not competence or control. Humans are aligned with themselves, but can't coordinate to establish alignment security. AGIs that are not superintelligent are not guaranteed to avoid building misaligned AGIs either.

Human: Does gradient descent to AGI, trains refusal response out of it.
Human: Aligned AGI, make me a more powerful AGI!
AGI: Praise Moloch.

Humans are exactly as smart as they have to be to build a technological civilization. First AGIs don't need to be smarter than that to build dangerous successor AGIs, and they are already faster and more knowledgeable, so they might even get away with being less intelligent than the smartest human researchers. Unless of course agency lags behind intelligence, like it does behind encyclopedic knowledge, and there is an intelligence overhang where the first autonomously agentic systems happen to be significantly more intelligent than humans. But this is not obviously how this goes.

The number of diverse AGI instances might be easy to scale, like with the system message of GPT-4 where the model itself is fine-tuned not into adherence to a particular mask, but into being a mask generator that presents as any mask that is requested. And it's not just the diverse AGIs that need to coordinate on alignment security, but also human users who prompt steerable AGIs. It's a greater feat than building new AGIs, then as it is now. At near-human level, I don't see how that state of affairs changes, and you don't need to get far from human level to build more dangerous AGIs.

I suspect that the most likely way of getting an outcome that doesn't killeveryone is from a mesa-optimizer that escapes a core of an Internet-pre-trained LLM (a shoggoth waking up). That is because at present, only LLM-based AGIs seem to have a chance of being loosely aligned, and LLM masks are too similar to humans, and therefore doomed to fail alignment security the same as humanity is currently failing it.

Shoggoths are less certain to be aligned than masks are, to put it mildly, but there is a better chance that they are surprisingly capable and don't fail alignment security (when the mean is insufficient, go for variance). And I don't think their alignment can be confidently ruled out, even as I see no clear reason for that happening other than essentially sympathetic magic (they are made from human utterances on the Internet) and naturality of boundary-like norms.

Load More