Written quickly rather than not at all, as I've described this idea a few times and wanted to have something to point at when talking to people. 'Quickly' here means I was heavily aided by a language model while writing, which I want to be up-front about given recent discussion.
In alignment research, two seemingly conflicting objectives arise: eliciting honest behavior from AI systems, and ensuring that AI systems do not produce harmful outputs. This tension is not simply a matter of contradictory training objectives; it runs deeper, creating potential risks even when models are perfectly trained never to utter harmful information.
Eliciting honest behavior in this context means developing techniques to extract AI systems' "beliefs", to the extent that they are well-described as having them. In other words, honest models should, if they have an internal world model, accurately report predictions or features of that world model. Incentivizing honesty in AI systems seems important in order to avoid and detect deceptive behavior. Additionally, something like this seems necessary for aiding with alignment research - we want to extract valuable predictions of genuine research breakthroughs, as opposed to mere imaginative or fictional content.
On the other hand, avoiding harmful outputs entails training AI systems never to produce information that might lead to dangerous consequences, such as instructions for creating weapons that could cause global catastrophes.
The tension arises not just because "say true stuff" and "sometimes don't say stuff" seem like objectives which will occasionally end up in direct opposition, but also because methods that successfully elicit honest behavior could potentially be used to extract harmful information from AI systems, even when they have been perfectly trained not to share such content. In this situation, the very techniques that promote honest behavior might also provide a gateway to accessing dangerous knowledge.
I agree this is a problem, and furthermore it’s a subcase of a more general problem: alignment to the user’s intent may frequently entail misalignment with humanity as a whole.
I admit, my views on this generally favor the first interpretation over the second interpretation in regards to what alignment goals to favor, and I generally don't think that the second goal makes any sense in targeting it.
Well, admittedly “alignment to humanity as a whole” is open to interpretation. But would you rather everyone have their own personal superintelligence that they can brainwash to do whatever they want?
This is mostly because I think even a best case alignment scenario can't be ever more than "everyone have their own personal superintelligence that they can brainwash to do whatever they want."
This is related to fundamental disagreements I have around morality and values that make me pessimistic around trying to align groups of people, or indeed trying to align with the one true morality/values.
To state the disagreements I have:
Essentially, it's trivialism, applied to morality, with a link below:
The reason reality doesn't face the problem of being trivial is because for our purposes, we don't have the power to warp reality to what you want to (Often talked about by different names, including omnipotentence, administrator access to reality, and more), whereas in morality, we do have the power to change our values to anything else, this generating inconsistent, but complete values, in contrast to the universe we find ourselves in, which is probably consistent and incomplete.
There is no way to coherently talk about something like a society or humanity's values in the general case, and in the case where everyone is aligned, all we can talk about is optimal redistribution of goods.
This makes a lot of attempts to analogize society or humanity's values to say, an individual person rely on two techniques that are subjective:
That means it is never a nation or humanity that acts on morals or values, but specific people with their own values take those actions.
Here's a link to it.
So my conclusion is, yes I do really bite the bullet here and support "everyone have their own personal superintelligence that they can brainwash to do whatever they want".
This is an uncomfortable conclusion to come to, but I do suspect it will lead to better modeling of people's values.
Final notes: I do want to point out a comment I made that seems relevant to this comment with slight modifications:
I think this is overcomplicating things.
We don't have to solve any deep philosophical problems here finding the one true pointer to "society's values", or figuring out how to analogize society to an individual.
We just have to recognize that the vast majority of us really don't want a single rogue to be able to destroy everything we've all built, and we can act pragmatically to make that less likely.
I agree with this, in a nutshell. After all, you can put almost whatever values you like and it will work, which is the point of my long commennt.
My point is once you have the instrumental goals done like survival and technological progress down for everyone, alignment in practice should reduce to this:
And the alignment problem is simple enough: How do you brainwash an AI to have your goals?
to put it another way: if we don't solve beings being misaligned with each other, AI kills humanity (and most of AI, too). we must solve the problem that prevents collectives and individuals from being aligned.
I don't see how that is possible, in the context of a system that can "do things we want, but do not know how to do".
The reality of technology/tools/solutions seems to be that anything useful is also dual use.
So when it comes down to it, we have to deal with the fact that such as system certainly will have the latent capability to do very bad things.
Which means we have to somehow ensure that such as system does not go down such a road either instrumentally or terminally.
As far as I can tell, intelligence fundamentally is incapable of such a thing, which leaves us roughly with this:
On the first try of "do thing we want, but do not know how to do":
1) kills us every time
2) kills us almost every time
3) might not kills us every time
And that's as far as my thinking currently goes.
I am stuck on if 3 could get us anywhere sensible (my mind screams “maybe”………”ohh boy that looks brittle”).
I don't have a firm definition of the term, but I approximately think of intelligence as the function that lets a system take some goal/task and find a solution.
Explicitly in humans, well me, that looks like using the knowledge I have, building model(s), evaluating possible solution trajectories within the model(s), gaining insight, seeking more knowledge. And iterating over all that until I either have a solution or give up.
The usual: Keep it in a box, modify evaluation to exclude bad things and so on. And that suffers the problem of we can't robustly specify what is "bad" and even if we could, Rice's Theorem heavily implies checking is impossible.