I agree that using the forms of human language does not ensure interpretability by humans, and I also see strong advantages to communication modalities that would discard words in favor of more expressive embeddings. It is reasonable to expect that systems with strong learning capacity could to interprete and explain messages between other systems, whether those messages are encoded in words or in vectors. However, although this kind of interpretability seems worth pursuing, it seems unwise to rely on it.
The open-agency perspective suggests that while interpretability is important for proposals, it is less important in understanding the processes that develop those proposals. There is a strong case for accepting potentially uninterpretable communications among models involved in generating proposals and testing them against predictive models — natural language is insufficient for design and analysis even among humans and their conventional software tools.
Plans of action, by contrast, call for concrete actions by agents, ensuring a basic form of interpretability. Evaluation processes can and should favor proposals that are accompanied by clear explanations that stand up under scrutiny.
This makes sense, but LLMs can be described in multiple ways. From one perspective, as you say,
ChatGPT is not running a simulation; it's answering a question in the style that it's seen thousands - or millions - of times before.
From different perspective (as you very nearly say) ChatGPT is simulating a skewed sample of people-and-situations in which the people actually do have answers to the question.
The contents of hallucinations are hard to understand as anything but token prediction, and by a system that (seemingly?) has no introspective knowledge of its knowledge. The model’s degree of confidence in a next-token prediction would be a poor indicator of degree of factuality: The choice of token might be uncertain, not because the answer itself is uncertain, but because there are many ways to express the same, correct, high-confidence answer. (ChatGPT agrees with this, and seems confident.)
Are you arguing that increasing the number of (AI) actors cannot make collusive cooperation more difficult? Even in the human case, defectors make large conspiracies more difficult, and in the non-human case, intentional diversity can almost guarantee failures of AI-to-AI alignment.
What you describe is a form of spiral causation (only seeded by claims of impossibility), which surely contributes strongly to what we’ve seen in the evolution of ideas. Fortunately, interest in the composite-system solution space (in particular, the open agency model) is now growing, so I think we’re past the worst of the era of unitary-agent social proof.
Nate [replying to Eric Drexler]: I expect that, if you try to split these systems into services, then you either fail to capture the heart of intelligence and your siloed AIs are irrelevant, or you wind up with enough AGI in one of your siloes that you have a whole alignment problem (hard parts and all) in there….
GTP-Nate is confusing the features of the AI services model with the argument that “Collusion among superintelligent oracles can readily be avoided”. As it says on the tin, there’s no assumption that intelligence must be limited. It is, instead, an argument that collusion among (super)intelligent systems is fragile under conditions that are quite natural to implement.
The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, this is the key question, and I think there’s a clear answer, at least in outline:
What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.
This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.
Mind space is very wide
Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.
The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:
The simulation objectiveA simulator trained with machine learning is optimized to accurately model its training distribution – in contrast to, for instance, maximizing the output of a reward function or accomplishing objectives in an environment.… Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do.
A simulator trained with machine learning is optimized to accurately model its training distribution – in contrast to, for instance, maximizing the output of a reward function or accomplishing objectives in an environment.… Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do.
LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.
So I think useful arrangements of diverse/competing/flawed systems is a hope in many contexts. It often doesn't work, so looks neglected, but not for want of trying.
What does and doesn’t work will depend greatly on capabilities, and the problem-context here assumes potentially superintelligent-level AI.
These are good points, and I agree with pretty much all of them.
Instances in a bureaucracy can be very different and play different roles or pursue different purposes. They might be defined by different prompts and behave as differently as text continuations of different prompts in GPT-3
I think that this is an important idea. Though simulators analogous to GPT-3, it may be possible to develop strong, almost-provably-non-agentic intelligent resources, then prompt them to simulate diverse, transient agents on the fly. From the perspective of building multicomponent architectures this seems like a strange and potentially powerful tool.
Regarding interpretability, tasks that require communication among distinct AI components will tend to expose information, and manipulating “shared backgrounds” between information sources and consumers could potentially be exploited to make that information more interpretable. (How one might train against steganography is an interesting question.)
I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.
To quote Paul again:.
I think that concerns about collusion are relatively widespread amongst the minority of people most interested in AI control. And these concerns have in fact led to people dismissing many otherwise-promising approaches to AI control, so it is de facto an important question
Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.
[Perhaps too-pointed remark deleted]