Søren Elverlin

Wiki Contributions


I would be excited to see Rational Animations try to cover the Hard Problem of Corrigibility: https://arbital.com/p/hard_corrigibility/

I believe that this would be the optimal video to create for the optimization target "reduce probability of AI-Doom". It seems (barely) plausible that someone really smart could watch the video, make a connection to some obscure subject none of us know about, and then produce a really impactful contribution to solving AI Alignment.

Talking concretely, what does a utility function look like that is so close to a human utility function that an AI system has it after a bunch of training, but which is an absolute disaster?

A simple example could be that the humans involved in the initial training are negative utilitarians. Once the AI is powerful enough, it would be able to implement omnicide rather than just curing diseases.

Thus in order to arrive at a conclusion of doom, it is not enough to argue that we cannot align AI perfectly.

I am open to being corrected, but I do not recall ever seeing a requirement of "perfect" alignment in the cases made for doom. Eliezer Yudkowsky in "AGI Ruin: A List of Lethalities" only asks for 'this will not kill literally everyone'.

Without investigating these empirical details, it is unclear whether a particular qualitatively identified force for goal-directedness will cause disaster within a particular time.

A sufficient criteria for a desire to cause catastrophe (distinct from having the means to cause catastrophe) is if the AI is sufficiently goal-directed to be influenced by Stephen Omohundro's "Basic AI Drives".

For instance, take an entity with a cycle of preferences, apples > bananas = oranges > pears > apples. The entity notices that it sometimes treats oranges as better than pears and sometimes worse. It tries to correct by adjusting the value of oranges to be the same as pears. The new utility function is exactly as incoherent as the old one.

It is possible that an AI will try to become more coherent and fail, but we are worried about the most capable AI and cannot rely on the hope that it will fail such a simple task. Being coherent is easy if the fruits are instrumental: Just look up the prices of the fruits.

However if we think that utility maximization is difficult to wield without great destruction, then that suggests a disincentive to creating systems with behavior closer to utility-maximization. Not just from the world being destroyed, but from the same dynamic causing more minor divergences from expectations, if the user can’t specify their own utility function well.

A strategically aware utility maximizer would try to figure out what your expectations are, satisfy them while preparing a take-over, and strike decisively without warning. We should not expect to see an intermediate level of "great destruction".

I prefer "AI Safety" over "AI Alignment" because I associate the first more with Corrigibility, and the second more with Value-alignment.

It is the term "Safe AI" that implies 0% risk, while "AI Safety" seems more similar to "Aircraft Safety" in acknowledging a non-zero risk.

The epistemic shadow argument further requires that the fast takeoff leads to something close to extinction.

This is not the least impressive thing I expect GPT-4 won't be able to do~.

I should have explained what I mean by "always (10/10)": If you generate 10 completions, you expect with 95% confidence that all 10 satisfies the criteria.

All the absolute statements in my post should be turned down from 100% to 99.5%. My intuition is that if less than 1 in 200 ideas are valuable, it will not be worthwhile to have the model contribute to improving itself.

Intelligence Amplification

GPT-4 will be unable to contribute to the core cognitive tasks involved in AI programming.

  • If you ask GPT-4 to generate ideas for how to improve itself, it will always (10/10) suggest things that an AI researcher considers very unlikely to work.
  • If you ask GPT-4 to evaluate ideas for improvement that are generated by an AI researcher, the feedback will be of no practical use.
  • Likewise, every suggestion for how to get more data or compute, or be more efficient with data or compute, will be judged by an AI researcher as hopeless.
  • If you ask GPT-4 to suggest performance improvements to the core part of its own code, every single of these will be very weak at best.
  • If there is an accompanying paper for GPT-4, there will be no prompts possible that would make GPT-4 suggest meaningful improvements to this paper.

I assign 95% to each of these statements. I expect we will not be seeing the start of a textbook takeoff in August.

Load More