Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Thanks to Garrett Baker and Evžen Wybitul for discussions and feedback on this post.

Imagine that you meet an 18th century altruist. They tell you “So, I’ve been thinking about whether or not to eat meat. Do you know whether animals have souls?” How would you answer, assuming you actually do want to be helpful?

One option is to spend a lot of time explaining why “soul” isn’t actually the thing in the territory they care about, and talk about moral patienthood and theories of welfare and moral status. If they haven’t walked away from you in the first thirty seconds this may even work, though I wouldn’t bet on it.

Another option is to just say “yes” or “no”, to try and answer what their question was pointing at. If they ask further questions, you can either dig in deeper and keep translating your real answers into their ontology or at some point try to retarget their questions’ pointers toward concepts that do exist in the territory.

Low-fidelity pointers

The problem you’re facing in the above situation is that the person you’re talking to is using an inaccurate ontology to understand reality. The things they actually care about correspond to quite different objects in the territory. Those objects currently don’t have very good pointers in their map. Trying to directly redirect their questions without first covering a fair amount of context and inferential distance over what these objects are probably wouldn’t work very well.

So, the reason this is relevant to alignment:

Representations of things within the environment are learned by systems up to the level of fidelity that’s required for the learning objective. This is true even if you assume a weak version of the natural abstraction hypothesis to be true; the general point isn’t that there wouldn’t be concepts corresponding to what we care about, but that they could be very fuzzy.

For example, let’s say that you try to retarget an internal general-purpose search process. That post describes the following approach:

  • Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
  • Identify the retargetable internal search process.
  • Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.

There are - very broadly, abstracting over a fair amount of nuance - three problems with this:

  • You need to have interpretability tools that are able to robustly identify human-relevant alignment properties from the AI’s internals[1]. This isn’t as much a problem with the approach, as it is the hard thing you have to solve for it to work.
  • It doesn’t seem obvious that existentially dangerous models are going to look like they’re doing fully-retargetable search. Learning some heuristics that are specialized to the environment, task, or target are likely to make your search much more efficient[2]. These can be selectively learned and used for different contexts. This imposes a cost on arbitrary retargeting, because you have to relearn those heuristics for the new target.
  • The concept corresponding to the alignment target you want is not very well-specified. Retargeting your model to this concept probably would make it do the right things for a while. However, as the model starts to learn more abstractions relating to this new target, you run into an under-specification problem where the pointer can generalize in one of several ways.

The first problem (or at least, some version of it) seems unavoidable to me in any solution to alignment. What you want in the end is to interact with the things in your system that would ensure you of its safety. There may be simpler ways to go about it, however. The second problem is somewhat related to the third, and would I think be solved by similar methods.

Notably, the third problem is still a problem in this setup even if you have a robust specification of the target yourself. Because the setup doesn’t involve going in and improving the model’s capabilities / ontology / world model fidelity, you’re upper-bounded by how good of a specification it already has. You could modify the setup to involve updating the model’s ontology, but that seems like a much harder problem if we wanted to do it in the same direct way.

If we didn’t have to do it in the same way however, there are some promising threads. The one that seems most straightforward to me is designing a training process such that the retargeting is integrated at every step. For instance, as part of a reward signal on what you want the search target to be. This would allow for the concepts you care about to always be closely downstream of the target you’re training on, and steer the model’s capabilities toward learning higher-fidelity representations of them[3].

This does mean that you have to have a good specification of the alignment target yourself; training on an under-specified signal or weak proxy is pretty risky. I don’t think this necessarily makes the problem harder because we may need such a specification to retarget the model at all, or generally do model editing, and so on. But it’s a consideration I don’t often see mentioned, that feeds into some aspects of what I work on, and how I think of good alignment solutions.

This post was in part inspired by this Twitter thread from last year mentioning this problem. I left a short reply on how I thought it could be solved, but figured it was worth writing up the entire thing in more detail at some point.

  1. ^
  2. ^

     I think Mark Xu’s example of quick sort vs random sorting is a good intuition pump here.

  3. ^

     The original proposal for retargeting the search also mentioned training-on-the-fly, but mainly in the context of not getting a misaligned AI before we do the retargeting.

New to LessWrong?

New Comment
20 comments, sorted by Click to highlight new comments since: Today at 7:48 AM

One option is to spend a lot of time explaining why “soul” isn’t actually the thing in the territory they care about, and talk about moral patienthood and theories of welfare and moral status.

In my opinion, the concepts of moral patienthood and theories of moral status are about as confused as the idea of souls.

[-]TAG2mo30

What do we care about if not souls? Suffering. What is suffering? A quale. What are qualia? Err....

[-]TAG2mo20
  1. "Caring about" and "existing in the territory" aren't independent.

  2. You are much too confident that "we" have an almost-correct ontology.

Science has made quite a lot of progress since the 18th century, to the point where producing phenomena we don't already have a workable ontology for tends to require giant accelerators, or something else along those lines. Ground-breaking new ideas are slowly becoming harder to find, and paradigm shifts are happening more rarely or in narrower subfields. That doesn't prove our ontology is perfect by any means, but it does suggest that it's fairly workable for. lot of common purposes. Particularly, I would imagine, for ones relating to AI alignment to our wishes, which is the most important thing that we want to be able to point to.

[-]TAG2mo40

Being able to detect new particles isn't proof of ontological progress. Any hypothesised particle is a bundle of observable properties, so it's phenomenology, not ontology. Giant 21st century atom smashers aren't going to tell you what a Higgs boson really is, any more than early ones are going to tell you what electron really is.

Historically,physics was very self-satisfied in the late nineteenth century...just before it was revolutionised.

The particular subjects you are writing about --ethics and consciousness -- are poorly understood, and can't be settled just by appealing to current physics.

Agreed. But the observed slowing down (since, say, a century ago) in the rate of the paradigm shifts that are sometimes caused by things like discovering a new particle does suggest that out current ontology is now a moderately good fit to a fairly large slice of the world. And, I would claim, it is particularly likely to be fairly good fit for the problem of pointing to human values.

We also don't require that our ontology fits the AI's ontology, only that when we point to something in our ontology, it knows what we mean — something that basically happens by construction in an LLM, since the entire purpose that it's ontology/world-model was learned for was figuring out what we mean and may say next. We may have trouble interpreting its internals, but it's a trained expert in interpreting our natural languages.

It is of course possible that our ontology still contains invalid concepts comparable to "do animals have souls"? My claim is just that this is less likely now than it was in the 18th century, because we've made quite a lot of progress in understanding the world since then. Also, if it did, an LLM would still know all about this invalid concept and our beliefs about it, just like it knows all about our beliefs about things like vampires, unicorns, or superheroes.

[-]TAG2mo10

Agreed. But the observed slowing down (since, say, a century ago) in the rate of the paradigm shifts that are sometimes caused by things like discovering a new particle does suggest that out current ontology is now a moderately good fit to a fairly large slice of the world

How can you tell? Again, you only have a predictive model. There is no way of measuring ontological fit directly.

Directly, no. But the process of science (like any use of Bayesian reasoning) is intended to gradually make our ontology a better fit to more of reality. If that was working as intended, then we would expect it to come to require more and more effort to produce the evidence needed to cause a significant further paradigm shift across a significant area of science, because there are fewer and fewer major large-scale misconceptions left to fix. Over the last century, we have more and more people working as scientists, publishing more and more papers, yet the rate of significant paradigm shifts that have an effect across a significant area of science has been dropping. From which I deduce that it is likely that our ontology is a probably a significantly better fit to reality now than it was a century ago, let alone three centuries ago back in the 18th century as this post discusses. Certainly the size and detail of our scientific ontology have both increased dramatically.

Is this proof? No, as you correctly observe, proof would require knowing the truth about reality. It's merely suggestive supporting evidence. It's possible to contrive other explanations: it's also possible, if rather unlikely, that, for some reason (perhaps related to social or educational changes) all of those people working in science now are much stupider, more hidebound, or less original thinkers than the scientists a century ago, and that's why dramatic paradigm shifts are slower — but personally I think this is very unlikely.

It is also quite possible that this is more true in certain areas of science that are amenable to the mental capabilities and research methods of human researchers, and that there might be other areas that were resistant to these approaches (so our lack of progress in these areas is caused by inability, not us approaching our goal), but where the different capabilities of an AI might allow it to make rapid progress. In such an area, the AI's ontology might well be a significantly better fit to reality than ours.

[-]TAG2mo22

Directly, no. But the process of science (like any use of Bayesian reasoning) is intended to gradually make our ontology a better fit to more of reality

Yes, ,it is intended to. Whether , and how it works , are other questions.

There's also nothing about Bayesianism that guarantees incrementally better ontological fit, in addition to incrementally improving predictive power.

Bayes' theorem is about the truth of propositions. Why couldn't it be applied to propositions about ontology?

[-]TAG2mo20

It's about one of the things "truth" means. If you want to apply it to ontology, you need a kind of evidence that's relevant to ontology -- that can distinguish hypotheses that make similar predictions.

Correct me if I'm wrong, but I think we could apply the concept of logical uncertainty to metaphysics and then use Bayes' theorem to update depending on where our metaphysical research takes us, the way we can use it to update the probability of logically necessarily true/false statements.

[-]TAG2mo10

How do we use Bayes to find kinds of truth other than predictiveness?

If you are dubious that the methods of rationality work, I fear you are on the wrong website.

[-]TAG2mo10

I'm not saying they don't work at all. I have no problem with prediction.

I notice that you didn't tell me how the methods of rationality work in this particular case. Did you notice that I conceded that they work in others?

If this website is about believing things that cannot be proven, and have never been explained, then it is "rationalist" not rationalist.

Might be much harder to implement, but could we maximin "all possible reinterpretations of alignment target X"?

If you have possible targets , then as an alternative to maximin, you could also maximize , or replace log with your favorite function with sufficiently-sharply-diminishing returns. This has the advantage that the AI will continue to take pareto-improvements rather than often feeling completely neutral about them.

(That’s just a fun idea that I think I got indirectly from Stuart Armstrong … but I’m skeptical that this would really be relevant: I suspect that if you have any uncertainty about the target at all, it would rapidly turn into a combinatorial explosion of possibilities, such that it would be infeasible for the AI to keep track of how an action scores on all gazillion of them.)

Geometric rationality ftw!

(In normal planning problems there are exponentially many plans to evaluate (in the number of actions). So that doesn't seem to be a major obstacle if your agent is already capable of planning.)

The thing you want to point to is "make the decisions that humans would collectively want you to make, if they were smarter, better informed, had longer to think, etc." (roughly, Coherent Extrapolated Volition, or something comparable). Even managing to just point to "make the same decisions that humans would collectively want you to make" would get us way past the "don't kill everyone" minimum threshold, into moderately good alignment, and well into the regions where alignment has a basin of convergence.

Any AGI built in the next few years is going to contain an LLM trained on trillions of tokens of human data output. So it will learn excellent and detailed world models of human behavior and psychology. An LLM's default base model behavior (before fine-tuning) is to prompt-dependently select some human psychology and then attempt to model it so as to emit the same tokens (and thus make the decisions) that they would. As such, pointing it at "what decision would humans collectively want me to make in this situation" really isn't that hard. You don't even need to locate the detailed world models inside it, you can just do all this with a natural language prompt: LLMs handle natural language pointers just fine.

The biggest problem with this is that the process is so prompt-dependent that it's easily perturbed, if part of your problem context data happens to contain something that perturbs the process in a way that jailbreaks its behavior. Which is probably a good reason why you might want to go ahead and locate those world models inside it, to try ensure that they're still being used and the model hasn't been jailbroken into doing something else.

I'd like to discuss this further, but since none of the people who disagree have mentioned why or how, I'm left to try to guess, which doesn't seem very productive. Do they think it's unlikely that a near-term AGI will contain an LLM, or do they disagree that you can (usually, though unreliably) use a verbal prompt to point at concepts in the LLM's world models, or do they have some other objection that hasn't occurred to me? A concrete example of what I'm discussing here would be Constitutional AI, as used by Anthropic, so it's a pretty-well-undertood concept that had actually been tried with some moderate success.