Wiki Contributions


The evaluation function of an AI is not its aim

You may be interested in some recent empirical experiments, demonstrating objective robustness failures/inner misalignment, including ones predicted in the risks from learned optimization paper.

What will 2040 probably look like assuming no singularity?

There is at least one firm doing drone delivery in China and they just approved a standard for it.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Mainly such complete (and irreversible!) delegation to such incompetent systems being necessary or executed. If AI is so powerful that the nuclear weapons are launched on hair-trigger without direction from human leadership I expect it to not be awful at forecasting that risk.

You could tell a story where bargaining problems lead to mutual destruction, but the outcome shouldn't be very surprising on average, i.e. the AI should be telling you about it happening with calibrated forecasts.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

The US and China might well wreck the world  by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment (for both companies and governments). Competitive pressures are the main reason why AI systems with inadequate 1-to-1 alignment would be given long enough leashes to bring catastrophe. I would cosign Vanessa and Paul's comments about these scenarios being hard to fit with the idea that technical 1-to-1 alignment work is much less impactful than cooperative RL or the like.


In more detail, I assign a ≥10% chance to a scenario where two or more cultures each progressively diminish the degree of control they exercise over their tech, and the safety of the economic activities of that tech to human existence, until an involuntary human extinction event.  (By comparison, I assign at most around a ~3% chance of a unipolar "world takeover" event, i.e., I'd sell at 3%.)

If this means that a 'robot rebellion' would include software produced by more than one company or country, I think that that is a substantial possibility, as well as the alternative, since competitive dynamics in a world with a few giant countries and a few giant AI companies (and only a couple leading chip firms) can mean that the way safety tradeoffs work is by one party introducing rogue AI systems that outcompete by not paying an alignment tax (and intrinsically embodying in themselves astronomically valuable and expensive IP), or cascading alignment failure in software traceable to a leading company/consortium or country/alliance. 

But either way reasonably effective 1-to-1 alignment methods (of the 'trying to help you and not lie to you and murder you with human-level abilities' variety) seem to eliminate a supermajority of the risk.

[I am separately skeptical that technical work on multi-agent RL is particularly helpful, since it can be done by 1-to-1 aligned systems when they are smart, and the more important coordination problems seem to be earlier between humans in the development phase.]

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off.  I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.

Right now the United States has a GDP of >$20T, US plus its NATO allies and Japan >$40T, the PRC >$14T, with a world economy of >$130T. For AI and computing industries the concentration is even greater.

These leading powers are willing to regulate companies and invade small countries based on reasons much less serious than imminent human extinction. They have also avoided destroying one another with nuclear weapons.

If one-to-one intent alignment works well enough that one's own AI will not blatantly lie about upcoming AI extermination of humanity, then superintelligent locally-aligned AI advisors will tell the governments of these major powers (and many corporate and other actors with the capacity to activate governmental action) about the likely downside of conflict or unregulated AI havens (meaning specifically the deaths of the top leadership and everyone else in all countries).

All Boards wish other Boards would stop doing this, but neither they nor their CEOs manage to strike up a bargain with the rest of the world stop it. 

Within a country, one-to-one intent alignment for government officials or actors who support the government means superintelligent advisors identify and assist in suppressing attempts by an individual AI company or its products to overthrow the government.

Internationally, with the current balance of power (and with fairly substantial deviations from it) a handful of actors have the capacity to force a slowdown or other measures to stop an outcome that will otherwise destroy them.  They (and the corporations that they have legal authority over, as well as physical power to coerce) are few enough to make bargaining feasible, and powerful enough to pay a large 'tax' while still being ahead of smaller actors. And I think they are well enough motivated to stop their imminent annihilation, in a way that is more like avoiding mutual nuclear destruction than cosmopolitan altruistic optimal climate mitigation timing.

That situation could change if AI enables tiny firms and countries to match the superpowers in AI capabilities or WMD before leading powers can block it.

So I agree with others in this thread that good one-to-one alignment basically blocks the scenarios above.

Another (outer) alignment failure story

I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald's and Burger King's marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?


I think the one that stands out the most is 'why isn't it possible for some security/inspector AIs to get a ton of marginal reward by whistleblowing against the efforts required for a flawless global camera grab?' I understand the scenario says it isn't because the demonstrations are incomprehensible, but why/how?

"New EA cause area: voting"; or, "what's wrong with this calculation?"

That is the opposite error, where one cuts off the close election cases. The joint probability density function over vote totals is smooth because of uncertainty (which you can see from polling errors), so your chance of being decisive scales proportionally with the size of the electorate and the margin of error in polling estimation.

"New EA cause area: voting"; or, "what's wrong with this calculation?"

The error is a result of assuming the coin is exactly 50%, in fact polling uncertainties mean your probability distribution over its 'weighting' is smeared over at least several percentage points. E.g. if your credence from polls/538/prediction markets is smeared uniformly from 49% to 54%, then the chance of the election being decided by a single vote is one divided by 5% of the # of voters.

You can see your assumption is wrong because it predicts that tied elections should be many orders of magnitude more common than they are. There is a symmetric error where people assume that the coin has a weighting away from 50%, so the chances of your vote mattering approach zero. Once you have a reasonable empirical distribution over voting propensities fit to reproduce actual election margins both these errors go away.

See Andrew Gelman's papers on this.

The Upper Limit of Value

As I said, the story was in combination with one-boxing decision theories and our duplicate counterparts.

The Upper Limit of Value

I suppose by 'the universe' I meant what you would call the inflationary multiverse, that is including distant regions we are now out of contact with. I personally tend not to call regions separated by mere distance separate universes.

"and the only impact of our actions with infinite values is the number of black holes we create."

Yes, that would be the infinite impact I had in mind, doubling the number would double the number of infinite branching trees of descendant universes.

Re simulations, yes, there is indeed a possibility of influencing other levels, although we would be more clueless, and it is a way for us to be in a causally connected patch with infinite future.

Load More