Written quickly rather than not at all, as I've described this idea a few times and wanted to have something to point at when talking to people. 'Quickly' here means I was heavily aided by a language model while writing, which I want to be up-front about given recent discussion.


In alignment research, two seemingly conflicting objectives arise: eliciting honest behavior from AI systems, and ensuring that AI systems do not produce harmful outputs. This tension is not simply a matter of contradictory training objectives; it runs deeper, creating potential risks even when models are perfectly trained never to utter harmful information.


Eliciting honest behavior in this context means developing techniques to extract AI systems' "beliefs", to the extent that they are well-described as having them. In other words, honest models should, if they have an internal world model, accurately report predictions or features of that world model. Incentivizing honesty in AI systems seems important in order to avoid and detect deceptive behavior. Additionally, something like this seems necessary for aiding with alignment research - we want to extract valuable predictions of genuine research breakthroughs, as opposed to mere imaginative or fictional content.

On the other hand, avoiding harmful outputs entails training AI systems never to produce information that might lead to dangerous consequences, such as instructions for creating weapons that could cause global catastrophes.

The tension arises not just because "say true stuff" and "sometimes don't say stuff" seem like objectives which will occasionally end up in direct opposition, but also because methods that successfully elicit honest behavior could potentially be used to extract harmful information from AI systems, even when they have been perfectly trained not to share such content. In this situation, the very techniques that promote honest behavior might also provide a gateway to accessing dangerous knowledge.


New Comment
8 comments, sorted by Click to highlight new comments since: Today at 10:30 PM

I agree this is a problem, and furthermore it’s a subcase of a more general problem: alignment to the user’s intent may frequently entail misalignment with humanity as a whole.

I admit, my views on this generally favor the first interpretation over the second interpretation in regards to what alignment goals to favor, and I generally don't think that the second goal makes any sense in targeting it.

Well, admittedly “alignment to humanity as a whole” is open to interpretation. But would you rather everyone have their own personal superintelligence that they can brainwash to do whatever they want?

Basically, yes.

This is mostly because I think even a best case alignment scenario can't be ever more than "everyone have their own personal superintelligence that they can brainwash to do whatever they want."

This is related to fundamental disagreements I have around morality and values that make me pessimistic around trying to align groups of people, or indeed trying to align with the one true morality/values.

To state the disagreements I have:

  1. I think to the extent that moral realism is right, morality/values is essentially trivial, in that every morality is correct, and I suspect that there is no non-arbitrary to restrict morality or values without sneaking in your own values.

Essentially, it's trivialism, applied to morality, with a link below:


  1. The reason reality doesn't face the problem of being trivial is because for our purposes, we don't have the power to warp reality to what you want to (Often talked about by different names, including omnipotentence, administrator access to reality, and more), whereas in morality, we do have the power to change our values to anything else, this generating inconsistent, but complete values, in contrast to the universe we find ourselves in, which is probably consistent and incomplete.

  2. There is no way to coherently talk about something like a society or humanity's values in the general case, and in the case where everyone is aligned, all we can talk about is optimal redistribution of goods.

This makes a lot of attempts to analogize society or humanity's values to say, an individual person rely on two techniques that are subjective:

Carrying out a simplification or homogenization of the multiple preferences of the individuals that make up that society;

Modeling your own personal preferences as if these were the preferences of society as a whole.

That means it is never a nation or humanity that acts on morals or values, but specific people with their own values take those actions.

Here's a link to it.


So my conclusion is, yes I do really bite the bullet here and support "everyone have their own personal superintelligence that they can brainwash to do whatever they want".

This is an uncomfortable conclusion to come to, but I do suspect it will lead to better modeling of people's values.

Final notes: I do want to point out a comment I made that seems relevant to this comment with slight modifications:

One important implication of the post relating to AI Alignment: It is impossible for AI to be aligned with society, conditional on the individuals not being all aligned with each other. Only in the N=1 case can guaranteed alignment be achieved.

In the pointers ontology, you can't point to a real world thing that is a society, culture or group having preferences or values, unless all members have the same preferences.

And thus we need to be more modest in our alignment ambitions. Only AI aligned to individuals is at all feasibly possible. And that makes the technical alignment groups look way better.

It's also the best retort to attempted collectivist cultures and societies.

I think this is overcomplicating things.

We don't have to solve any deep philosophical problems here finding the one true pointer to "society's values", or figuring out how to analogize society to an individual.

We just have to recognize that the vast majority of us really don't want a single rogue to be able to destroy everything we've all built, and we can act pragmatically to make that less likely.

We don't have to solve any deep philosophical problems here finding the one true pointer to "society's values", or figuring out how to analogize society to an individual.

I agree with this, in a nutshell. After all, you can put almost whatever values you like and it will work, which is the point of my long commennt.

My point is once you have the instrumental goals done like survival and technological progress down for everyone, alignment in practice should reduce to this:

Everyone have their own personal superintelligence that they can brainwash to do whatever they want.

And the alignment problem is simple enough: How do you brainwash an AI to have your goals?

to put it another way: if we don't solve beings being misaligned with each other, AI kills humanity (and most of AI, too). we must solve the problem that prevents collectives and individuals from being aligned.

avoiding harmful outputs entails training AI systems never to produce information that might lead to dangerous consequences


I don't see how that is possible, in the context of a system that can "do things we want, but do not know how to do".

The reality of technology/tools/solutions seems to be that anything useful is also dual use.

So when it comes down to it, we have to deal with the fact that such as system certainly will have the latent capability to do very bad things.

Which means we have to somehow ensure that such as system does not go down such a road either instrumentally or terminally.

As far as I can tell, intelligence[1] fundamentally is incapable of such a thing, which leaves us roughly with this:

  1. Pure intelligence, onus is on us to specify terminal goals correctly.
  2. Pure intelligence and cage/rules/guardrails[2] etc.
  3. Pure intelligence with a mind explicitly in charge of directing the intelligence.

On the first try of "do thing we want, but do not know how to do":

1) kills us every time 

2) kills us almost every time

3) might not kills us every time

And that's as far as my thinking currently goes.

I am stuck on if 3 could get us anywhere sensible (my mind screams “maybe”………”ohh boy that looks brittle”).

  1. ^

    I don't have a firm definition of the term, but I approximately think of intelligence as the function that lets a system take some goal/task and find a solution.

    Explicitly in humans, well me, that looks like using the knowledge I have, building model(s), evaluating possible solution trajectories within the model(s), gaining insight, seeking more knowledge. And iterating over all that until I either have a solution or give up.

  2. ^

    The usual: Keep it in a box, modify evaluation to exclude bad things and so on. And that suffers the problem of we can't robustly specify what is "bad" and even if we could, Rice's Theorem heavily implies checking is impossible.

New to LessWrong?