Thane Ruthenis - LessWrong

(Written while I'm at the title of "Respecting Modularity".)

My own working definition of "corrigibility" has been something like "an AI system that obeys commands, and only produces effects through causal pathways that were white-listed by its human operators, with these properties recursively applied to its interactions with its human operators".

In a basic case, if you tell it to do something, like "copy a strawberry" or "raise the global sanity waterline", it's going to give you a step-by-step outline of what it's going to do, how these actions are going to achieve the goal, how the resultant end-state is going to be structured (the strawberry's composition, the resultant social order), and what predictable effects all of this would have (both direct effects and side-effects).

So if it's planning to build some sort of nanofactory that boils the oceans as a side-effect, or deploy Basilisk hacks that exploit some vulnerability in the human psyche to teach people stuff, it's going to list these pathways, and you'd have the chance to veto them. Then you'll get it to generate some plans that work through causal pathways you do approve of, like "normal human-like persuasion that doesn't circumvent the interface of the human mind / doesn't make the abstraction "the human mind" leak / doesn't violate the boundaries of the human psyche".

It's also going to adhere to this continuously: e. g., if it discovers a new causal pathway and realizes the plan it's currently executing has effects through it, it's going to seek urgent approval from the human operators (while somehow safely halting its plan using a procedure for this that it previously designed with its human operators, or something).

And this should somehow apply recursively. The AI should only interact with the operators through pathways they've approved of. E. g., using only "mundane" human-like ways to convey information; no deploying Basilisk hacks to force-feed them knowledge, no directly rewriting their brains with nanomachines, not even hacking their phones to be able to talk to them while they're outside the office.

(How do we get around the infinite recursion here? I have no idea, besides "hard-code some approved pathways into the initial design".)

And then the relevant set of "causal pathways" probably factors through the multi-level abstract structure of the environment. For any given action, there is some set of consequences that is predictable and goes into the AI's planning. This set is relatively small, and could be understood by a human. Every consequence outside this "small" set is unpredictable, and basically devolves into high-entropy noise; not even an ASI could predict the outcome. (Think this post.) And if we look at the structure of the predictable-consequences sets across time, we'd find rich symmetries, forming the aforementioned "pathways" through which subsystems/abstractions interact.

(I've now read the post.)

This seems to fit pretty well with your definition? Visibility: check, correctability: check. The "side-effects" property only partly fits – by my definition, a corrigible AI is allowed to have all sorts of side-effects, but these side-effects must be known and approved by its human operator – but I think it's gesturing at the same idea. (Real-life tools also have lots of side effects, e. g. vibration and noise pollution from industrial drills – but we try to minimize these side-effects. And inasmuch as we fail, the resultant tools are considered "bad", worse than the versions of these tools without the side-effects.)

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Thane Ruthenis1moΩ570

That was my interpretation as well.

I think it does look pretty alarming if we imagine that this scales, i. e., if these learned implicit concepts can build on each other. Which they almost definitely can.

The "single-step" case, of the SGD chiseling-in a new pattern which is a simple combination of two patterns explicitly represented in the training data, is indeed unexciting. But once that pattern is there, the SGD can chisel-in another pattern which uses the first implicit pattern as a primitive. Iterate on, and we have a tall tower of implicit patterns building on implicit patterns, none of which are present in the training data, and which can become arbitrarily more sophisticated and arbitrarily more alien than anything in the training set. And we don't even know what they are, so we can't assess their safety, and can't even train them out (because we don't know how to elicit them).

Which, well, yes: we already knew all of this was happening. But I think this paper is very useful in clearly showcasing this.

One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn't demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah).

Which implies some interesting things about how the representations are stored. The LLM actually "understands" what f(blah) is built out of, in a way that's accessible to its externalized monologue. That wasn't obvious to me, at least.

The Leopold Model: Analysis and Reactions

Thane Ruthenis1mo103

I believe Xi (or choose your CCP representative) would say that the ultimate goal is human flourishing

I'm very much worried that this sort of thinking is a severe case of Typical Mind Fallacy.

I think the main terminal values of the individuals constituting the CCP – and I do mean terminal, not instrumental – are the preservation of their personal status, power, and control, like the values of ~all dictatorships, and most politicians in general. Ideology is mostly just an aesthetics, a tool for internal and external propaganda/rhetoric, and the backdrop for internal status games.

There probably are some genuine shards of ideology in their minds. But I expect minuscule overlap between their at-face-value ideological messaging, and the future they'd choose to build if given unchecked power.

On the other hand, if viewed purely as an organization/institution, I expect that the CCP doesn't have coherent "values" worth talking about at all. Instead, it is best modeled as a moral-maze-like inertial bureaucracy/committee which is just replaying instinctive patterns of behavior.

I expect the actual "CCP" would be something in-between: it would intermittently act as a collection of power-hungry ideology-biased individuals, and as an inertial institution. I have no idea how this mess would actually generalize "off-distribution", as in, outside the current resource, technology, and power constraints. But I don't expect the result to be pretty.

Mind, similar holds for the USG too, if perhaps to a lesser extent.

On Dwarksh’s Podcast with Leopold Aschenbrenner

Thane Ruthenis2mo179

Maybe they develop mind control level convincing argument and send it to key people (president, congress, NORAD, etc) or hack their iPhones and recursively down to security guards of fabs/power plants/data centers/drone factories. That may be quick enough. The point is that it is not obvious.

That's the sort of thing that'd happen, yes. As with all AI takeover scenarios, it likely wouldn't go down like this specifically, but you can be sure that the ASI would achieve the goal it wants to achieve/was told to achieve if aligned. (And see this post for my model of how this class of concrete scenarios would actually look like.)

Having nukes is not really a good analogy for having an aligned ASI at your disposal, as far as taking over the world is concerned. Unless your terminal value is human extinction, you can't really nuke the world into the state of your personal utopia. You can't even use nukes as leverage to threaten people into building your utopia, because:

Some people are good enough at decision theory to ignore threats.
Coercing people in this way might not actually be part of your utopia.
Your "power" is brittle. You only have the threat of nuclear armageddon to fall back on, and you can still be defeated by e. g. clever infiltration and sabotage, or by taking over your supply chains, etc. (If you have overwhelming, utterly loyal military power and security in full generality, that's a very different setup.)

None of those constraints apply to having an ASI at your disposal. An ASI would let you implement your values upon the cosmos fully and faithfully, and it'd give you the roadmap to getting there from here.

This is also precisely why Leopold's talk of "checks and balances" as the reason why governments could be trusted with AGI falls apart. "The government" isn't some sort of holistic entity, it's a collection of individuals with their own incentives, sometimes quite monstrous incentives. In the current regime, it's indeed checked-and-balanced to be mostly sort-of (not really) aligned to the public good. But that property is absolutely not robust to you giving unchecked power to any given subsystem in it!

I'm really quite baffled that Leopold doesn't get this, given his otherwise excellent analysis of the "authoritarianism risks" associated with aligned ASIs in the hands of private companies and the CCP. Glad to see @Zvi pointing that out.

My AI Model Delta Compared To Yudkowsky

Thane Ruthenis2mo235

We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.

For context, I'm familiar with this view from the ELK report. My understanding is that this is part of the "worst-case scenario" for alignment that ARC's agenda is hoping to solve (or, at least, still hoped to solve a ~year ago).

To quote:

The paradigmatic example of an ontology mismatch is a deep change in our understanding of the physical world. For example, you might imagine humans who think about the world in terms of rigid bodies and Newtonian fluids and “complicated stuff we don’t quite understand,” while an AI thinks of the world in terms of atoms and the void. Or we might imagine humans who think in terms of the standard model of physics, while an AI understands reality as vibrations of strings. We think that this kind of deep physical mismatch is a useful mental picture, and it can be a fruitful source of simplified examples, but we don’t think it’s very likely.
We can also imagine a mismatch where AI systems use higher-level abstractions that humans lack, and are able to make predictions about observables without ever thinking about lower-level abstractions that are important to humans. For example we might imagine an AI making long-term predictions based on alien principles about memes and sociology that don’t even reference the preferences or beliefs of individual humans. Of course it is possible to translate those principles into predictions about individual humans, and indeed this AI ought to make good predictions about what individual humans say, but if the underlying ontology is very different we are at risk of learning the human simulator instead of the “real” mapping.
Overall we are by far most worried about deeply “messy” mismatches that can’t be cleanly described as higher- or lower-level abstractions, or even what a human would recognize as “abstractions” at all. We could try to tell abstract stories about what a messy mismatch might look like, or make arguments about why it may be plausible, but it seems easier to illustrate by thinking concretely about existing ML systems.
[It might involve heuristics about how to think that are intimately interwoven with object level beliefs, or dual ways of looking at familiar structures, or reasoning directly about a messy tapestry of correlations in a way that captures important regularities but lacks hierarchical structure. But most of our concern is with models that we just don’t have the language to talk about easily despite usefully reflecting reality. Our broader concern is that optimistic stories about the familiarity of AI cognition may be lacking in imagination. (We also consider those optimistic stories plausible, we just really don’t think we know enough to be confident.)]

So I understand the shape of the argument here.

... But I never got this vibe from Eliezer/MIRI. As I previously argued, I would say that their talk of different internal ontologies and alien thinking is mostly about, to wit, different cognition. The argument is that AGIs won't have "emotions", or a System 1/System 2 split, or "motivations" the way we understands them – instead, they'd have a bunch of components that fulfill the same functions these components fulfill in humans, but split and recombined in a way that has no analogues in the human mind.

Hence, it would be difficult to make AGI agents "do what we mean" – but not necessarily because there's no compact way to specify "what we mean" in the AGI's ontology, but because we'd have no idea how to specify "do this" in terms of the program flows of the AGI's cognition. Where are the emotions? Where are the goals? Where are the plans? We can identify the concept of "eudaimonia" here, but what the hell is this thought-process doing with it? Making plans about it? Refactoring it? Nothing? Is this even a thought process?

This view doesn't make arguments about the AGI's world-model specifically. It may or may not be the case that any embedded agent navigating our world would necessarily have nodes in its model approximately corresponding to "humans", "diamonds", and "the Golden Gate Bridge". This view is simply cautioning against anthropomorphizing AGIs.

Roughly speaking, imagine that any mind could be split into a world-model and "everything else": the planning module, the mesa-objective, the cached heuristics, et cetera. The MIRI view focuses on claiming that the "everything else" would be implemented in a deeply alien manner.

The MIRI view may be agnostic regarding the Natural Abstraction Hypothesis as well, yes. The world-model might also be deeply alien, and the very idea of splitting an AGI's cognition into a world-model and a planner might itself be an unrealistic artefact of our human thinking.

But even if the NAH is true, the core argument would still go through, in (my model of) the MIRI view.

And I'd say the-MIRI-view-conditioned-on-assuming-the-NAH-is-true would still have p(doom) at 90+%: because it's not optimistic regarding anyone anywhere solving the natural-abstractions problem before the blind-tinkering approach of AGI labs kills everyone.

(I'd say this is an instance of an ontology mismatch between you and the MIRI view, actually. The NAH abstraction is core to your thinking, so you factor the disagreement through those lens. But the MIRI view doesn't think in those precise terms!)

Natural Latents Are Not Robust To Tiny Mixtures

Thane Ruthenis2mo64

Another angle to consider: in this specific scenario, would realistic agents actually derive natural latents for and $Q$ as a whole, as opposed to deriving two mutually incompatible latents for the $Q^{0}$ and $P^{0}$ components, then working with a probability distribution over those latents?

Intuitively, that's how humans operate if they have two incompatible hypotheses about some system. We don't derive some sort of "weighted-average" ontology for the system, we derive two separate ontologies and then try to distinguish between them.

This post comes to mind:

If you only care about betting odds, then feel free to average together mutually incompatible distributions reflecting mutually exclusive world-models. If you care about planning then you actually have to decide which model is right or else plan carefully for either outcome.

Like, "just blindly derive the natural latent" is clearly not the whole story about how world-models work. Maybe realistic agents have some way of spotting setups structured the way the OP is structured, and then they do something more than just deriving the latent.

Natural Latents Are Not Robust To Tiny Mixtures

Thane Ruthenis2mo40

Sure, but what I question is whether the OP shows that the type signature wouldn't be enough for realistic scenarios where we have two agents trained on somewhat different datasets. It's not clear that their datasets would be different the same way and $Q$ are different here.

Natural Latents Are Not Robust To Tiny Mixtures

Thane Ruthenis2mo40

I do see the intuitive angle of "two agents exposed to mostly-similar training sets should be expected to develop the same natural abstractions, which would allow us to translate between the ontologies of different ML models and between ML models and humans", and that this post illustrated how one operationalization of this idea failed.

However if there are multiple different concepts that fit the same natural latent but function very differently

That's not quite what this post shows, I think? It's not that there are multiple concepts that fit the same natural latent, it's that if we have two distributions that are judged very close by the KL divergence, and we derive the natural latents for them, they may turn out drastically different. The agent and the $Q$ agent legitimately live in very epistemically different worlds!

Which is likely not actually the case for slightly different training sets, or LLMs' training sets vs. humans' life experiences. Those are very close on some metric $X$ , and now it seems that $X$ isn't (just) $D_{K L}$ .

Natural Latents Are Not Robust To Tiny Mixtures

Thane Ruthenis2mo91

Coming from another direction: a 50-bit update can turn into $P$ , or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.

Are you sure this is undesired behavior? Intuitively, small updates (relative to the information-content size of the system regarding which we're updating) can drastically change how we're modeling a particular system, into what abstractions we decompose it. E. g., suppose we have two competing theories regarding how to predict the neural activity in the human brain, and a new paper comes out with some clever (but informationally compact) experiment that yields decisive evidence in favour of one of those theories. That's pretty similar to the setup in the post here, no? And reading this paper would lead to significant ontology shifts in the minds of the researchers who read it.

Which brings to mind How Many Bits Of Optimization Can One Bit Of Observation Unlock?, and the counter-example there...

Indeed, now that I'm thinking about it, I'm not sure the quantity $\frac{bit-size of the update}{bit-size of the system}$ is in any way interesting at all? Consider that the researchers' minds could be updated either from reading the paper and examining the experimental procedure in detail (a "medium" number of bits), or by looking at the raw output data and then doing a replication of the paper (a "large" number of bits), or just by reading the names of the authors and skimming the abstract (a "small" number of bits).

There doesn't seem to be a direct causal connection between the system's size and the amount of bits needed to drastically update on its structure at all? You seem to expect some sort of proportionality between the two, but I think the size of one is straight-up independent of the size of the other if you let the nature of the communication channel between the system and the agent-doing-the-updating vary freely (i. e., if you're uncertain regarding whether it's "direct observation of the system" OR "trust in science" OR "trust in the paper's authors" OR ...).^[1]

Indeed, merely describing how you need to update using high-level symbolic languages, rather than by throwing raw data about the system at you, already shaves off a ton of bits, decoupling "the size of the system" from "the size of the update".

Perhaps $D_{K L}$ really isn't the right metric to use, here? The motivation for having natural abstractions in your world-model is that they make the world easier to predict for the purposes of controlling said world. So similar-enough natural abstractions would recommend the same policies for navigating that world. Back-tracking further, the distributions that would give rise to similar-enough natural abstractions would be distributions that correspond to worlds the policies for navigating which are similar-enough...

I. e., the distance metric would need to take interventions/the $do$ operator into account. Something like SID comes to mind (but not literally SID, I expect).

^{^}
Though there may be some more interesting claim regarding that entire channel? E. g., that if the agent can update drastically just based on a few bits output by this channel, we have to assume that the channel contains "information funnels" which compress/summarize the raw state of the system down? That these updates have to be entangled with at least however-many-bits describing the ground-truth state of the system, for them to be valid?

What do coherence arguments actually prove about agentic behavior?

Thane Ruthenis2mo60

I think the main "next piece" missing is that Eliezer basically rejects the natural abstraction hypothesis

Mu, I think. I think the MIRI view on the matter is that the internal mechanistic implementation of an AGI-trained-by-the-SGD would be some messy overcomplicated behemoth. Not a relatively simple utility-function plus world-model plus queries on it plus cached heuristics (or whatever), but a bunch of much weirder modules kludged together in a way such that their emergent dynamics result in powerful agentic behavior.^[1]

The ontological problems with alignment would stem not from the fact that the AI is using alien concepts, but from its own internal dynamics being absurdly complicated and alien. It wouldn't have a well-formatted mesa-objective, for example, or "emotions", or a System 1 vs System 2 split, or explicit vs. tacit knowledge. It would have a dozen other things which fulfill the same functions that the aforementioned features of human minds fulfill in humans, but they'd be split up and recombined in entirely different ways, such that most individual modules would have no analogues in human cognition at all.

Untangling it would be a "second tier" of the interpretability problem, which the current interpretability research didn't yet even get a glimpse of.

And, sure, maybe at some higher level of organization, all that complexity would be reducible to simple-ish agentic behavior. Maybe a powerful-enough pragmascope would be able to see past all that and yield us a description of the high-level implementation directly. But I don't think the MIRI view is hopeful regarding getting such tools.

Whether the NAH is or is not true doesn't really enter into it.

Could be I'm failing the ITT here, of course. But this post gives me this vibe, as does this old write-up. Choice quote^[2]:

The reason why we can’t bind a description of ‘diamond’ or ‘carbon atoms’ to the hypothesis space used by AIXI or AIXI-tl is that the hypothesis space of AIXI is all Turing machines that produce binary strings, or probability distributions over the next sense bit given previous sense bits and motor input. These Turing machines could contain an unimaginably wide range of possible contents
(Example: Maybe one Turing machine that is producing good sequence predictions inside AIXI, actually does so by simulating a large universe, identifying a superintelligent civilization that evolves inside that universe, and motivating that civilization to try to intelligently predict future future bits from past bits (as provided by some intervention). To write a formal utility function that could extract the ‘amount of real diamond in the environment’ from arbitrary predictors in the above case , we’d need the function to read the Turing machine, decode that universe, find the superintelligence, decode the superintelligence’s thought processes, find the concept (if any) resembling ‘diamond’, and hope that the superintelligence had precalculated how much diamond was around in the outer universe being manipulated by AIXI.)

Obviously it's talking about AIXI, not ML models, but I assume the MIRI view has a directionally similar argument regarding them.

Or, in other words: what the MIRI view rejects isn't the NAH, but some variant of the simplicity-prior argument. It doesn't believe that the SGD would yield nicely formatted agents; that the ML training loops produce pressures shaping minds this way.^[3]

^{^}
This powerful agentic behavior would then of course be able to streamline its own implementation, once it's powerful enough, but that's what the starting point would be – and also what we'd need to align, since once it has the extensive self-modification capabilities to streamline itself, it'd be too late to tinker with it.
^{^}
Although now that I'm looking at it, this post is actually a mirror of the Arbital page, which has three authors, so I'm not entirely sure this segment was written by Eliezer...
^{^}
Note that this also means that formally solving the Agent-Like Structure Problem wouldn't help us either. It doesn't matter how theoretically perfect embedded agents are shaped, because the agent we'd be dealing with wouldn't be shaped like this. Knowing how it's supposed to be shaped would help only marginally, at best giving us a rough idea regarding how to start untangling the internal dynamics.

LESSWRONG
LW

Posts

Wiki Contributions

Comments