Wiki Contributions


Maybe one scenario in this direction is that a non-super-intelligent AI gains access to the internet and then spreads itself to a significant fraction of all computational devices, using them to solve some non-consequential optimization problem. This would aggravate a lot of people (who lose access to their computers) and also demonstrate the potential of AIs to have significant impact on the real world.

As the post mentions, there is an entire hierarchy of such unwanted AI behavior. The first such phenomena like reward hacking are already occurring now. The next level (such as an AI creating a copy of itself in anticipation of an operator trying to shut it down) might occur at levels below those representing a threat of an intelligence explosion, but it's unclear whether the general public will see a lot of information about these. I think it's an important empirical question how wide the window is between the AI levels producing publically visible misalignment-events and the threshold where the AI becomes genuinely dangerous.

Surely all pivotal acts that safeguard humanity long into the far future are entirely rational in explanation.

I agree that in hindsight such acts would appear entirely rational and justified, but to not represent a PR problem, they must appear justified (or at least acceptable) to a member of the general public/a law enforcement official/a politician.

Can you offer a reason for why a pivotal act would be a PR problem, or why someone would not want to tell people their best idea for such an act and would use the phrase "outside the Overton window" instead?

To give one example: the oft-cited pivotal act of "using nanotechnology to burn all GPUs" is not something you could put as the official goal on your company website. If the public seriously thought that a group of people pursued this goal and had any chance of even coming close to achieving it, they would strongly oppose such a plan. In order to even see why it might be a justified action to take, one needs to understand (and accept) many highly non-intuitive assumptions about intelligence explosions, orthogonality, etc.

More generally, I think many possible pivotal acts will to some degree be adversarial since they are literally about stopping people from doing or getting something they want (building an AGI, reaping the economic benefits from using an AGI, etc). There might be strategies for such an act which are inside the overton window (creating a superhuman propaganda-bot that convinces everyone to stop), but all strategies involving anything resembling force (like burning the GPUs) will run counter to established laws and social norms.

So I can absolutely imagine that someone has an idea about a pivotal act which, if posted publically, could be used in a PR campaign by opponents of AI alignment ("look what crazy and unethical ideas these people are discussing in their forums"). That's why I was asking what the best forms of discourse could be that avoid this danger.

I am not as convinced that there don’t exist pivotal acts that are importantly easier than directly burning all GPUs (after which I might or might not then burn most of the GPUs anyway). There’s no particular reason humans can’t perform dangerous cognition without AGI help and do some pivotal act on their own, our cognition is not exactly safe. But if I did have such an idea that I thought would work I wouldn’t write about it, and it most certainly wouldn’t be in the Overton window. Thus, I do not consider the failure of our public discourse to generate such an act to be especially strong evidence that no such act exists.

Given how central the execution of a pivotal act seems to be to our current best attempt at an alignment strategy (see point 6 of EY's post) I was confused about finding very little discussion about possible approaches here in the forum. Does the quote above already fully explain this (since all promising approaches are too far out of the Overton window to discuss publically)? Or has no one gotten around to initializing such a conversation? Or, quite possibly, have I overlooked extensive discussions in this direction?

It seems to me that having a long document with the 20 most commonly proposed approaches to such a pivotal act together with an analysis of their strengths and weaknesses, possibilities to give comments etc could be quite valuable for people who want to start thinking about such approaches. Also there is always a possibility of someone just having a really great idea (or maybe person A having a flawed idea containing the seed of a great idea, that inspires person B to propose a fix). Would other people also find this useful?

On the other hand, given possible counter-indications of such public discourse (proposals outside the Overton window representing a PR problem, or some proposals only being feasible without being publically announced), are there other strategies for reaping the benefits of many people with different backgrounds thinking about this problem? Things that come to mind: maybe a non-public essay contest where people can hand in a description of a possible pivotal act together with their own analysis concerning its feasibility. Those could be read by a panel of trusted experts (trusted both to have some competence in their judgement and in their confidentiality). Then harmless but insightful ones could be released for the public. Dangerous and/or non-insightful ones could be returned to their creators with a brief description why they are deemed a bad idea. And finally promising ones could be brought to the attention of people with ressources to further pursue them.

Concerning point 35 about playing AIs off against each other: I analyzed a particular scenario like this in a recent post and also came to the conclusion that cooperation between the AIs is the default outcome in many scenarios. However, in the last subsection of that post, I start thinking about some ways to prevent an acausal trade as Eliezer describes it here (committing to sharing the universe with any AI reviewing the code). The idea is roughly that the code and as much information as possible about the AI doing the checking will be deleted before the checked nanosystem is released. In this way, it's not possible for the escaping AI to share half the universe with its checker, since not enough information about the checker and its goals is left in the universe to enact this compromise. This reduces the incentive of the checker to cooperate/hide alignment faults.

The post above describes various caveats and other things that could go wrong in such a setup. I tried my best at being pessimistic and pointing out various failure modes.

I think that "the value alignment problem" is not something that currently has a universally acknowledged and precise definition and a lot of the work that is currently being done is to get less confused about what is meant by this.

From what I see, in your proof you have started from a particular meaning of this term and then went on to show it is impossible.

Which means that human values, or at least the individual non-morality-based values don’t converge, which means that you can’t design an artificial superintelligence that contains a term for all human values

Here you observe that if "the value alignment problem" means to construct something which has the values of all humans at the same time, it is impossible because there exist humans with contradictory values. So you propose the new definition "to construct something with all human moral values". You continue to observe that the four moral values you give are also contradictory, so this is also impossible.

And even if somehow you could program an intelligence to optimize for those four competing utility functions at the same time,

So now we are looking at the definition "to program for the four different utility functions at the same time". As has been observed in a different comment, this is somewhat underspecified and there might be different ways to interpret and implement it. For one such way you predict

that would just cause it to optimize for conflict resolution, and then it would just tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves.

It seems to me that the scenario behind this course of events would be: we build an AI, give it the four moralities and noticing their internal contradictions, it analyzes them to find that they serve the purpose of conflict resolution. Then it proceeds to make this its new, consistent goal and builds these tiny conflict scenarios. I'm not saying that this is implausible, but I don't think it is a course of events without alternatives (and these would depend on the way the AI is built to resolve conflicting goals).

To summarize, I think out of the possible specifications of "the value alignment problem", you picked three (all human values, all human moral values, "optimizing the four moralities") and showed that the first two are impossible and the third leads to undesired consequences (under some further assumptions).

However, I think there are many things which people would consider a solution of "the value alignment problem" and which don't satisfy one of these three descriptions. Maybe there is a subset of the human values without contradiction, such that most people would be reasonably happy with the result of a superhuman AI optimizing these values. Maybe the result of an AI maximizing only the "Maximize Flourishing"-morality would lead to a decent future. I would be the first to admit that those scenarios I describe are themselves severely underspecified, just vaguely waving at a subset of the possibility space, but I imagine that these subsets could contain things we would call "a solution of the value alignment problem".