Wiki Contributions

Comments

That's not a simple problem.First you have to specify "not killing everyone" robustly (outer alignment) and then you have to train the AI to have this goal and not an approximation of it (inner alignment).

See my other comment for the response.

Anyway, the rest of your response is spent talking about the case where AI cares about its perception of the paperclips rather than the paperclips themselves. I'm not sure how severity level 1 would come about, given that the AI should only care about its reward score. Once you admit that the AI cares about worldly things like "am I turned on", it seems pretty natural that the AI would care about the paperclips themselves rather than its perception of the paperclips. Nevertheless, even in severity level 1, there is still no incentive for the AI to care about future AIs, which contradicts concerns that non-superintelligent AIs would fake alignment during training so that future superintelligent AIs would be unaligned.

We don't know how to represent "do not kill everyone"

I think this goes to Matthew Barnett’s recent article of actually yes we do. And regardless I don’t think this point is a big part of Eliezer’s argument. https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument

We don't know how to pick which quantity would be maximized by a would-be strong consequentialist maximizer

Yeah so I think this is the crux of it. My point is that if we find some training approach that leads to a model that cares about the world itself rather than hacking some reward function, that’s a sign that we can in fact guide the model in important ways and there’s a good chance this includes being able to tell it not to kill everyone

We don't know know what a strong consequentialist maximizer would look like, if we had one around, because we don't have one around (because if we did, we'd be dead)

This is just a way of saying “we don’t know what AGI would do”. I don’t think this point pushes us toward x-risk any more than it pushes us toward not-x-risk.

I also do not think the responses to this question are satisfying enough to be refuting. I don’t even think they are satisfying enough to make me confident I haven’t just found a hole in AI risk arguments.This is not a simple case of “you just misunderstand something simple”.

I don’t care that much but if LessWrong is going to downvote sincere questions because it finds them dumb or whatever this will make for a site very unwelcoming to newcomers

I do indeed agree this is a major problem even if I'm not sure if I agree with the main claim. The rise of fascism in the last decade and expectation that it will continue is extremely evident; its consequences for democracy are a lot less clear.

The major wrinkle in all of this is in assessing anti-democratic behavior. Democracy indices not a great way of assessing democracy for much the same reason that the Doomsday Clock is a bad way of assessing nuclear risk: they're subjective metrics by (probably increasingly) left-leaning academics and tend to measure a lot of things that I wouldn't classify as democracy (eg rights of women/LGBT people/minorities). This paper found that using re-election rates there has been no evidence of global democratic backsliding. This started quite the controversy in political science; my read on the subsequent discussion is that there is evidence of backsliding, but such backsliding has been fairly modest.

I expect things to get worse as more countries get far-right leaders and those which already have far-right leaders have their democratic institutions increasingly captured by far-right leaders. And yet...a lot of places with far-right leaders continue to have close elections. See Poland, Turkey, Israel if you count that. In Brazil they even lost election. One plausible theory here is that the more anti-democratic behavior a party engages in the more resistance they face - either because voters are turned off or because their opponents increasingly become center or center-right parties seeking to create broad pro-democracy coalitions - and that this roughly balances out. What does this mean for how one evaluates democracy?

 

Finally, some comments specifically on more Western countries. I think the future of these countries is really uncertain.

For the next decade, it's really dependent on a lot of short-term events. Will Italy's PM Meloni engage in anti-democratic behavior? Will Le Pen win election in France, and if so will she engage in anti-democratic behavior? Will Trump win in 2024? How quickly/far will the upward trend in polling for Germany and Spain's far-right continue?

I know the piece specifies the next decade, but more long-term, the rise of fascism has come quite suddenly in the span of these last 8 years. If it continues for a few decades (and AI doesn't kill us all) then we are probably destined for fascist governments almost everywhere and the deterioration of democratic institutions. But how long this global trend will last is really the big question in global politics. Maybe debates over AI issues will become the big issue to supplant fascism? IDK. I'd love to see some analysis of historical trends in public approval to see what a prior for this question would look like; I've never gotten around to doing it myself and am really not very well informed about history here. 

I'm going to quote this from an EA Forum post I just made for why simply repeated exposure to AI Safety (through eg media coverage) will probably do a lot to persuade people:

[T]he more people hear about AI Safety, the more seriously people will take the issue. This seems to be true even if the coverage is purporting to debunk the issue (which as I will discuss later I think will be fairly rare) - a phenomenon called the illusory truth effect. I also think this effect will be especially strong for AI Safety. Right now, in EA-adjacent circles, the argument over AI Safety is mostly a war of vibes. There is very little object-level discussion - it's all just "these people are relying way too much on their obsession with tech/rationality" or "oh my god these really smart people think the world could end within my lifetime". The way we (AI Safety) win this war of vibes, which will hopefully bleed out beyond the EA-adjacent sphere, is just by giving people more exposure to our side.  

No, it does not say that either. I’m assuming you’re referring to “choose our words carefully”, but stating something imprecisely is a far ways from not telling the truth.

Nowhere in that quote does it say we should not speak the truth

Yeah so this seems like what I was missing.

But it seems to me that in these types of models, where the utility function is based on the state of the world rather than on input to the AI, aligning the AI not to kill humanity is easier. Like if an AI gets a reward every time it sees a paperclip, then it seems hard to punish the AI for killing humans because "human dies" is a hard thing for an AI with just sensory input to explicitly recognize. If however the AI is trained on a bunch of runs where the utility function is the number of paperclips actually created, then we can also penalize the model for the number of people who actually die.

I'm not very familiar with these forms of training so I could be off here.

Steelmanning is useful as a technique because often the intuition of somebody’s argument is true even if the precise argument they are using is not. If the other person is a rationalist, then you can point out the argument’s flaws and expect them to update the argument to more precisely explore their intuition. If not, you likely have to do some of the heavily lifting for them by steelmanning their argument and seeing where its underlying intuition might be correct.

This post seems only focused on the rationalist case.

Load More