tl;dr: I showthat model splintering can be seen as going beyond the human training distribution (the distribution of real and imagined situations we have firm or vague preferences over), and argue why this is at the heart of AI alignment.
You are training an AI-CEO to maximise stock value, training it on examples of good/bad CEO decisions and corresponding stock-price increases/decreases.
There are some obvious failure modes. The AI could wirehead by hacking the stock-ticker, or it could do the usual optimise-the-universe-to-maximise-stock-price-for-now-dead-shareholders.
Let's assume that we've managed to avoid these degenerate solutions. Instead, the AI-CEO tries for something far weirder.
The AI-CEO reorients the company towards the production of semi-sentient teddy bears that are generated in part from cloned human brain tissue. These teddies function as personal assistants and companions, and prototypes are distributed at the annual shareholder meeting.
However, the public reaction is negative, and the government bans the further production of these teddies. Consequently, the company shuts down for good. But the shareholders, who own the only existent versions of these teddies, get great kudos from possessing these rare entities, who also turn out to be great and supportive listeners - and excellent at managing their owners' digital media accounts, increasing their popularity and status.
And that, of course, was the AI-CEO's plan all along.
Hold off from judging this scenario, just for a second. And when you do judge it, observe your mental process as you do so. I've tried to build this scenario so that it is:
If I've pitched it right, your reaction to the scenario should be similar to mine - "I need to think about this more, and I need more information". The AI-CEO is clearly providing some value to the shareholders; whether this value can be compared to the stock price is unclear. It's being manipulative, but not doing anything illegal. As for the teddies themselves... I (Stuart) feel uncomfortable that they are grown from human brain tissue, but they are not human, and we humans have relationships with less sentient beings (pets). I'd have to know more about potential suffering and the preferences and consciousness - if any - of these teddies...
I personally feel that, depending on circumstances, I could come down in favour or against the AI-CEO's actions. If your own views are more categorical, see if you can adjust the scenario until it's similarly ambiguous for you.
This scenario involved model-splintering in two ways. The first was when the AI-CEO decided to not follow the route of "increase share price", and instead found another way of giving value to the shareholders, while sending the price to zero. This is unexpected, but it's not a moral surprise; we can assess its value by trying to quantify the extra value the teddies give their owners, and compare these with the lost share price. We want to check that, whatever model the AI-CEO is using to compare these two values, it's a sensible one.
The second model-splintering is the morality of creating the teddies. For most of us, this will be a new situation, which we will judge by connecting it to previous values or analogies (excitement about the possibilities, morality of using human tissue, morality of sentient beings whose preferences may or may not be satisfied, morality of the master-servant relationship that this resembles, slippery slope effects vs. early warning, etc).
Like the first time you encounter a tricky philosophical thought experiment, or the first time you deal with ambiguous situations where norms come into conflict, what's happening is that you are moving beyond your moral training data. This does not fit neatly into previous categories, nor can it easily be analysed with the tools of previous categories. But we are capable of analysing it, somehow, and to come up with non-stupid decisions.
So, we can extrapolate our values in non-stupid ways to these new situations. But that extrapolation may be contingent; a lot may depend on what analogies we reach first, on how we heard about the scenario, and so on.
But let's re-iterate the "non-stupid" point again. Our contingent extrapolations don't tend to fail disastrously (at least not when we have to implement our plans). For instance, humans rarely reach the conclusion that wireheading - hacking the stock-ticker - is the moral thing to do.
This skill doesn't always work (humans are much more likely than AIs to extrapolate into the "actively evil" zone, rather than the "lethally indifferent") but it is a skill that seems necessary to resolve extrapolated/model splintered situations in non-disastrous ways.
See the world from the point of view of a superintelligence. The future is filled with possibilities and plans, many of them far more wild and weird than the example I just defined, most of them articulated in terms of concepts and definitions beyond our current human minds.
And an aligned superintelligence needs to decide what to do about them. Even if it follows a policy that is mostly positive, this policy will have weird, model-splintered side effects. It needs to decide whether these side-effects are allowable, or whether it must devote resources to removing them. Maybe the cheapest company it can create will recruit someone, who, with their new salary, will start making these teddies themselves. It can avoid employing that person - but that's an extra cost. Should it pay that cost? As it looks upon all human in the world, it can predict their behaviours will change as a result of developing its current company - what behaviour changes are allowed, what should be avoided or encouraged?
Thus it cannot make decisions in these situations without going beyond the human training distribution; hence it is essential that it learns to extrapolate moral values in a way similar to how humans do.
(humans are much more likely than AIs to extrapolate into the "actively evil" zone, rather than the "lethally indifferent")
It seems to me that use of the term "actively evil" is itself guided by being part of our training data.
Lots of things called "actively evil" possibly achieve that designation just because they're things that humans have already done and have been judged evil. Now actions of this type are well-known to be evil, so humans choosing them can really only be through an active choice to do it anyway, presumably because it's viewed as necessary to some goal that supersedes that socially cached judgement.
I don't see why an AI couldn't reason in the same way: knowing (in some sense) that humans judge certain actions and outcomes as evil, disregarding that judgement and doing it anyway due to being on a path to some instrumental or terminal goal. I think that would be actively evil in the same sense that many humans can be said to be actively evil.
Do you mean that the space of possible actions that an AI explores might be so much larger than those explored by all humans in history combined, that it just by chance doesn't implement any of the ones similar enough to known evil? I think that's implausible unless the AI was actively avoiding known evil, and therefore at least somewhat aligned already.
Apart from that, it's possible we just differ on the use of the term "lethally indifferent". I take it to mean "doesn't know the consequences of its actions to other sentient beings", like a tsunami or a narrowly focused paperclipper that doesn't have a model of other agents. I suspect maybe you mean "knows but doesn't care", while I would describe that as "actively evil".
I (Stuart) feel uncomfortable that they are grown from human brain tissue, but they are not human
(Given the ability for them to manage social media accounts)
This obviously calls for Shoip of Theseus-ing this until we end up with:
Well, they're people brought back from the down dead, but they are not human.
This scenario involved model-splintering
Would have been straightforward to have the AI just...buy back stocks, or do stuff to invest in the longterm future, that hurts the stock price now.
"We're doing a $1 million kickstarter to release all our code, open source. We are committed, to ensuring security for our users and clients, and this is the best way forward in light of all these zero day exploits (of major companies such as ...) in the wild.
We will, of course be experimenting with different protocols for handling 'responsible disclosure'. "
The second model-splintering is the morality of creating the teddies. For most of us, this will be a new situation, which we will judge by connecting it to previous values or analogies
You can also judge by...interacting with the new reality.
morality of the master-servant relationship that this resembles
Who's to say they don't take over after those people die? Throwing away the company in favor of influencing the world (their new advisors say AIs are a great investment, or that existing one)...