- Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy.
Good point. You're right that they've delayed things. In fact, I get the impression that they've delayed for issues I personally wouldn't even have worried about.
I don't think that makes me believe that they will be able to refrain, permanently or even for a very long time, from doing anything they've really invested in, or anything they really see as critical to their ability to deliver what they're selling. They haven't demonstrated any really long delays, the pressure to do more is going to go nowhere but up, and organizational discipline tends to deteriorate over time even without increasing pressure. And, again, the paper's already talking about things like "recommending" against deployment, and declining to analyze admittedly relevant capabilities like Web browsing... both of which seem like pretty serious signs of softening.
But they HAVE delayed things, and that IS undeniably something.
As I understand it, Anthropic was at least partially founded around worries about rushed deployment, so at a first guess I'd suspect Anthropic's discipline would be last to fail. Which might mean that Anthropic would be first to fail commercially. Adverse selection...
- I'm unsure if 'selectively' refers to privileged users, or the evaluators themselves.
It was meant to refer to being selective about users (mostly meaning "customerish" ones, not evaluators or developers). It was also meant to refer to being selective about which of the model's intrinsic capabilities users can invoke and/or what they can ask it to do with those capabilities.
They talk about "strong information security controls". Selective availability, in that sort of broad sense, is pretty much what that phrase means.
As for the specific issue of choosing the users, that's a very, very standard control. And they talk about "monitoring" what users are doing, which only makes sense if you're prepared to stop them from doing some things. That's selectivity. Any user given access is privileged in the sense of not being one of the ones denied access, although to me the phrase "privileged user" tends to mean a user who has more access than the "average" one.
[still 2]: My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this).
From page 4 of the paper:
A simple heuristic: a model should be treated as highly dangerous if it has a capability profile that would be sufficient for extreme harm, assuming misuse and/or misalignment. To deploy such a model, AI developers would need very strong controls against misuse (Shevlane, 2022b) and very strong assurance (via alignment evaluations) that the model will behave as intended.
I can't think what "deploy" would mean other than "give users access to it", so the paper appears to be making a pretty direct implication that users (and not just internal users or evaluators) are expected to have access to "highly dangerous" models. In fact that looks like it's expected to be the normal case.
- I don't think people are expecting the models to be extremely useful without also developing dangerous capabilities.
That seems incompatible with the idea that no users would ever get access to dangerous models. If you were sure your models wouldn't be useful without being dangerous, and you were committed to not allowing dangerous models to be used, then why would you even be doing any of this to begin with?
- From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are[...etc...]
OK, but I'm responding to this paper and to the inferences people could reasonably draw from it, not to inside information.
And the list you give doesn't give me the sense that anybody's internalized the breadth and depth of things users could do to add capabilities. Giving the model access to all the tools you can think of gives you very little assurance about all the things somebody else might interconnect with the model in ways that would let it use them as tools. It also doesn't deal with "your" model being used as a tool by something else. Possibly in a way that doesn't look at all like how you expected it to be used. Nor with it interacting with outside entities in more complex ways than than the word "tool" tends to suggest.
As for the paper itself, it does seem to allude to some of that stuff, but then it ignores it.
That's actually the big problem with most paradigms based on "red teaming" and "security auditing", even for "normal" software. You want to be assured not only that the software will resist the specific attacks you happen to think of, but that it won't misbehave no matter what anybody does, at least over a broader space of action you can possibly test. Just trying things out to see how the software responds is of minimal help there... which is why those sorts of activities aren't primary assurance methods for regular software development. One of the scary things about ML is that the most of the things that are primary for other software don't really work on it.
On fine tuning, it hadn't even occurred to me that any user would be ever be given any ability to do any kind of training on the models. At least not in this generation. I can see that I had a blind spot there.
In the long term, though, the whole training-versus-inference distinction is a big drag on capability. A really capable system would extract information from everything it did or observed, and use that information thereafter, just as humans and animals do. If anybody figures out how to learn from experience the way humans do, with anything like the same kind of data economy, it's going to be very hard to resist doing it. So eventually you have a very good chance that there'll be systems that are constantly "fine tuning" themselves in unpredictable ways, and that get long-term memory of everything in the process. That's what I was getting at when I mentioned the "shelf life" of architectural assumptions.
- If I understood the paper correctly, by 'stakeholder' they most importantly mean government/regulators.
I think they also mentioned academics and maybe some others.
... which is exactly why I said that they didn't seem to have a meaningful definition of what a "stakeholder" was. Talking about involving "stakeholders", and then acting as though you've achieved that by involving regulators, academics, or whoever, is way too narrow and trivializes the literal meaning of the word "stakeholder".
It feels a lot like talking about "alignment" and acting as though you've achieved it when your system doesn't do the things on some ad-hoc checklist.
It also feels like falling into a common organizational pattern where the set of people tapped for "stakeholder involvement" is less like "people who are affected" and more like "people who can make trouble for us".
- No idea what you are referring to, I don't see any mention in the paper of letting certain people safe access to a dangerous model (unless you're talking about the evaluators?)
As I said, the paper more or less directly says that dangerous models will be deployed. And if you're going to "know your customer", or apply normal access controls, then you're going to be picking people who have such access. But neither prior vetting nor surveillance is adequate.
Finally - I get the feeling that your writing is motivated by your negative outlook,
If you want go down that road, then I get the feeling that the paper we're talking about, and a huge amount of other stuff besides, is motivated by a need to feel positive regardless of whether it make sense.
and not by trying to provide good analysis,
That's pretty much meaningless and impossible to respond to.
concrete feedback,
The concrete feedback is that the kind of "evaluation" described in that paper, with the paper's proposed ways of using the results, isn't likely to be particularly effective for what it's supposed to do, but could be a very effective tool for fooling yourself into thinking you'd "done enough".
If you make that kind of approach the centerpiece of your safety system, or even a major pillar of it, then you are probably giving yourself a false sense of security, and you may be diverting energy better used elsewhere. Therefore you should not do that unless those are your goals.
or an alternative plan.
It's a fallacy to respond to "that won't work" with "well, what's YOUR plan?". My not having an alternative isn't going to make anybody else's approach work.
One alternative plan might be to quit building that stuff, erase what already exists, and disband those companies. If somebody comes up with a "real" safety strategy, you can always start again later. That approach is very unlikely to work, because somebody else will build whatever you would have... but it's probably strictly better in terms of mean-time-before-disaster than coming up with rationalizations for going ahead.
Another alternative plan might be to quit worrying about it, so you're happier.
I find it unhelpful.
... which is how I feel about the original paper we're talking about. I read it as an attempt to feel more comfortable about a situation that's intrinsically uncomfortable, because it's intrinsically dangerous, maybe in an intrinsically unsolvable way. If comfort is the goal, then I guess it's helpful, but if being right is the goal, then it's unhelpful. If the comfort took the pressure off of somebody who might otherwise come up with a more effective safety approach, then it would be actively harmful... although I admit that I don't see a whole lot of hope for that anyway.
That is a terrifying paper.
The strategy and mindset I seen all through it are "make things we know might be extremely dangerous, then check after the fact to see how much damage we've done".
Even the evaluate-during-training prong amounts to a way to find dangerous approaches that could be continued later. And I mean, wow, they say they might even go as far as "delaying a schedule"! I mean, at least if it's "frontier training". And it makes architectural assumptions that probably have a short shelf life.
There are SO MANY wishful ideas in there...
Those are all mostly false. They sort of admit that some of them are false, or at best suspect. They put a bunch of caveats in section 5. Those caveats are notable for being ignored throughout the rest of the paper even though they make it mostly useless in practice.
The most importantly false idea, and the one they least seem to recognize, may be the organizational self-discipline one. Organizations are, as they say, Moloch.
This paper itself is already rationalizing ignoring risks: "We omit many generically useful capabilities (e.g. browsing the internet, understanding text) despite their potential relevance to both the above.". Actually I'm not sure it's fair to say that they rationalize ignoring that. They just flatly say it's out of scope, with no reason given. Which is kind of a classic sign of "that's unthinkable given the Molochian nature of our organizations".
As for actually giving up anything dangerous, an example: If you have an "agency core" and a general-purpose programming assistant, you are almost all of the way to having a robo-hacker. If you have defensive security analysis system, that puts you even closer. They don't even all have to come from the same source; people can plug these things together very easily.
I do not believe that any of these companies are going to give up on creating agents, or programming assistants, or even on "long horizon planning" or on even the narrow sense they use for "situational awareness". The idea does not pass the laugh test.
The paper alludes to the possibility that some things actually might not get deployed at all once created, if they turned out to be unexpectedly dangerous. Well, technically, it mentions that some evaluator might take the bold step of recommending against deployment. They're not quite willing to say out loud that anybody in particular ought to stop deployment.
Non-deployment is not going to happen. Not for anything really capable that's already absorbed significant investment. Not with enough probability to matter. That's not how people behave in groups, and not even usually how people behave singly.
Indeed, the paper is already moving on to rationalizing RECOGNIZED dangerous deployments: " To deploy such a model, AI developers would need very strong controls against misuse and very strong assurance (via alignment evaluations) that the model will behave as intended.".
The facts that such security controls don't exist, would be extremely hard to create, and might be impossible to create while remaining commercially viable, is just ignored. Their suggestions in table 3 are incredibly underwhelming. And their list of "security controls" in 3.4 is, um, shall we say, naive. They lead with "red teaming"...
They also ingore the fact the fact that no "strong" assurance that the model will behave as intended probably can exist. Again, the stuff in section 4 is not going to cut it.
Most likely the practical effect of letting this approach become part of the paradigm will be that they'll kid themselve that they've achieved adequate control, by pretending that they can pick trustworthy users, pretending that those "trustworthy" users can't themselves be subverted, and probably also preending that they can do something about it by surveilling users ("continuous deployment review"). We already have Brad Smith out there talking about "know your customer", which seems to dovetail nicely with this.
The "trustworthy users" thing will help not at all. The surveillance will help a little, until the models leak.
... and even if something is not "deployed", it still exists. At least the knowledge of how to recreate it still exists.
Software leaks. ML misbehaves unpredictably. Most of the utility of these things lies in constantly using them in completely novel ways. You will be dealing with intentional misuse. The paper's comparison to "food, drugs, commercial airliners, and automobiles" is a horrible analogy.
Frankly, in the end, the whole paper reads like an elaborate rationalization for making as little change as possible in what people are already doing, while providing a sort of signifier that "we care". It is not credible as an actual, effective approach to safety. It's not even a major part of such an approach. At best it could be an auditing function, and it would be one of those auditing functions where if you ever had a finding, it meant you had screwed up INCREDIBLY BADLY and been extremely lucky not to have a catastrophe.
The best hope for keeping these labs from deploying really dangerous stuff is still a total shutdown. Which, to be clear, would have to be imposed from the outside. By coercion. Because they are not going to do it themselves. That is very unlikely to be on the table even if it's the right approach.
... and it might not be the right approach, because it still wouldn't help much.
"The labs" aren't the whole issue and may very well not be the main issue. Whoever follows any kind of safety framework, there will also be a lot of people who won't.
There's a 99 point as many nines as you want percent chance that, right this minute, multiple extremely well-resourced actors are pouring tons of work into stuff specifically intended to have most of that paper's listed "dangerous capabilities". The good news is that the first big successes will probably be pretty closely held. The bad news is that we'll be lucky to get a year or two after that before those either leak or get duplicated as open source, and everybody and his dog has access to very capable systems for at least some of those things. My guess is that one of the first out of the box will be autonomous, adaptive computer security penetration (not "conducting offensive cyber operations", ffs).
I actually don't know of any way at all to deal with THAT. Even draconian controls on compute probably wouldn't give you much of a delay.
Pretending that this kind of thing will help, beyond maybe a couple of months of delay of any given bad outcome if you're extremely lucky, is not reasonable. Sorry.
They even talk about existing evaluations for things like "gender and racial biases, truthfulness, toxicity, recitation of copyrighted content" as if the results we've seen were cause for optimism rather than pessimism. ↩︎
... while somehow not setting up an extremely, self-perpetuatingly unfair economy where some people have access to powerful productivity tools that are forbidden to others... ↩︎
I'm not sure I said that.
You didn't, but I thought it was pretty much the entire point of the original article.
I don't think there's a path to that,
There may not be a path, but that doesn't change the fact that not doing it guarantees misery.
and I don't think it's sufficient even if there were.
It's definitely not sufficient. You'd have to replace it with something else. And probably make unrelated changes.
But I was trying to challenge this idea that you were somehow still going to earn your daily bread by selling the product of your labor... presumably to the holders of capital. I mean, the post does mention that you're best off to be a holder of capital, but that's not going to be available to most people.
It's very easy to lose a small pile of capital, and relatively easy to add to a large pile of capital. It always has been, but it's about to get a lot more so. Capital concentrates. So most people are not going to have enough capital to survive just by owning things, at least not unless the system decrees that everybody owns at least some minimum share no matter what. That's definitely not capitalism.
So the post is basically about "working for a living". And that might work through 2026, or 2036, or whatever.
And sure, maybe you can do OK in 2026 or even 2036 by doing what this post suggests. If you do those things, maybe you'll even feel like you're moving up in the world. But most actual humans aren't capable of doing what this post suggests (and many of the rest would be miserable doing it). Some people are going to fall by the wayside. They won't be using AI assistants; they'll be doing things that don't need an AI assistant, but that AI can't do itself. Which are by no means guaranteed to be anything anybody would want to do.
As time goes on, AI will get smarter and more independent, shrinking the "adapt" niche. And robotics will get better, shrinking the "unautomatable work" niche.
There's an irreducible cost to employing somebody. You have to feed that person. Some people already can't produce enough to meet that bar. As the number of things humans can do that AI can't shrinks drastically, the number of such unemployable humans will rise. Fewer and fewer humans will be able to justify their existence.
Yes, that's in an absolute sense. People talk about "new jobs made possible by the new technology". That's wishful thinking. When machines replaced muscle in the industrial revolution, there was a need for brain. Operating even a simple power tool takes a lot of brain. When machines replace brain, that's it. Game over. Past performance is not a guarantee of future results.
In the end game (not by 2026), the only value that literally any human will be able to produce above the cost of feeding that person will be things that are valued only for being done by humans.... and valued by the people or entities that actually have something else to trade for them. There aren't likely to be that many. Humans in general will be no more employable than chimpanzees.
... however, unlike chimpanzees, if you keep capitalism in anything remotely like its current form, humans may very well not be permitted the space or resources to take care of themselves on their own terms. You can't ignore the larger ultra-efficient economy and trade among yourselves, if everything, down to the ground you're standing on, is owned by something or somebody that can get more out of it some other way.
there will remain SOME form personal property,
That's not capitalism. Not unless it's ownership of capital, and really if you want it to look like what the word "capitalism" connotes, it kind of has to be a quite a lot of capital. Enough to sustain yourself from what it produces.
and SOME way of varying individual incentive/reward to effective fulfillment of other people's needs,
Again, eventually you're gonna be irrelevant to fulfilling other people's needs. If there's no preparation for that, it's going to come as quite a shock.
The only exception might be the needs of people who are just as frozen out as you are. And there's no guarantee that you will be in a position either to produce what you or they need, or to accumulate capital of your own, because all the prerequisite resources may be owned by somebody else who's using them more "effectively".
and SOME mechanism to make decisions about short- vs long-term risk-taking in where one invests time/resources.
We're headed toward a world in which letting any human make a really major decision about resource allocation would mean inefficient use of the resources. Possibly insupportably inefficient.
We're not there yet. We're not going to be there in 2026, either. But we're heading there faster and faster.
If you want an end-state system where a bunch of benevolent-to-the-point-of-enslavement AIs run everything, supporting humans is a or the major goal for the AIs, an AI's "consumption" is measured by how much support it gets to give to humans, and the AIs run some kind of market system to see which of them "owns" more resources to do that, then that's a capitalist system. But humans aren't the players in that system. And if you're truly superintelligent, you can probably do better. Markets are a pretty good information processing system, but they're not perfect.
In the meantime, the things that let capitalism work among humans are falling apart. Once there's no way to get into the club by using your labor to build up capital from scratch, pre-existing holders of capital become an absolute oligarchy. And capital's tendency to concentrate means it's a shrinking oligarchy. And eventually membership in that oligarchy is decided either by inheritance, or by things you did so long ago that basically nobody remembers them. Or possibly no human at all owns anything... sort of an "Accelerando" scenario.
I think that starts to come into being even before the ultimate end game, but in any case it's going to happen eventually.
That's not a tenable system, it's not an equitable system, and only a very small proportion of people could "thrive" under it. It would collapse if not sustained by insane amounts of force. The longer we keep moving toward such a world, the more extreme the collapse is likely to be.
So, yeah, there may not be a path to fixing it, but that means we're all boned, not that we're thriving.
Independent of potential for growing into AGI and {S,X}-risk resulting from that?
With the understanding that these are very rough descriptions that need much more clarity and nuance, that one or two of them might be flat out wrong, that some of them might turn out to be impossible to codify usefully in practice, that there there might be specific exceptions for some of them, and that the list isn't necessarily complete--
Recommendation systems that optimize for "engagement" (or proxy measures thereof).
Anything that identifies or tracks people, or proxies like vehicles, in spaces open to the public. Also collection of data that would be useful for this.
Anything that mass-classifies private communications, including closed group communications, for any use by anybody not involved in the communication.
Anything specifically designed to produce media showing real people in false situations or to show them saying or doing things they have not actually done.
Anything that adaptively tries to persuade anybody to buy anything or give anybody money, or to hold or not hold any opinion of any person or organization.
Anything that tries to make people anthropomorphize it or develop affection for it.
Anything that tries to classify humans into risk groups based on, well, anything.
Anything that purports to read minds or act as a lie detector, live or on recorded or written material.
Actually, my point in this post is that we don't NEED AGI for a great future, because often people equate Not AGI = Not amazing future (or even a terrible one) and I think this is wrong.
I don't have so much of a problem with that part.
It would prevent my personal favorite application for fully generally strongly superhuman AGI... which is to have it take over the world and keep humans from screwing things up more. I'm not sure I'd want humans to have access to some of the stuff non-AGI could do... but I don't think here's any way to prevent that.
If we build a misaligned AGI, we're dead. So there are only two options: A) solve alignment, B) not build AGI. If not A), then there's only B), however "impossible" that may be.
C) Give up.
Anyway, I haven't seen you offer an alternative.
You're not going to like it...
Personally, if made king of the world, I would try to discourage at least large scale efforts to develop either generalized agents or "narrow AI", especially out of opaque technology like ML. Thats because narrow AI could easily become parts or tools for a generalized agent, because many kinds of narrow AI are too dangerous in human hands, and because the tools and expertise for narrow AI are too close to those for generalized AGI,. It would be extremely difficult to suppress one in practice without suppressing the other.
I'd probably start by making it as unprofitable as I could by banning likely applications. That's relatively easy to enforce because many applications are visible. A lot of the current narrow AI applications need bannin' anyhow. Then I'd start working on a list of straight-up prohibitions.
Then I'd dump a bunch of resources into research on assuring behavior in general and on more transparent architectures. I would not actually expect it to work, but it has enough of a chance to be worth a try,. That work would be a lot more public than most people on Less Wrong would be comfortable with, because I'm afraid of nasty knock-on effects from trying to make it secret. And I'd be a little looser about capability work in service of that goal than in service of any other.
I would think very hard about banning large aggregations of vector compute hardware, and putting various controls on smaller ones, and would almost certainly end up doing it for some size thresholds. I'm not sure what the thresholds would be, nor exactly what the controls would be. This part would be very hard to enforce regardless.
I would not do anything that relied on perfect enforcement for its effecitveness, and I would not try to ramp up enforcement to the point where it was absolutely impossible to break my rules, because I would fail and make people miserable. I would titrate enforcement and stick with measures that seemed to be working without causing horrible pain.
I'd hope to get a few years out of that, and maybe a breakthrough on safety if I were tremendously lucky. Given oerfect confidence in a real breakthrough, I would try to abdicate in favor of the AGI.
If made king of only part of the world, I would try to convince the other parts to collaborate with me in imposing roughly the same regime. How I reacted if they didn't do that would depend on how much leverage I had and what they did seem to be doing. I would try really, really hard not to start any wars over it. Regardless of what they said they were doing I would assume that they were engaging in AGI research under the table. Not quite sure what I'd do with that assumption, though.
But I am not king of the world, and I do not think it's feasible for me to become king of the world.
I also doubt that the actual worldwide political system, or even the political systems of most large countries, can actually be made to take any very effective measures within any useful amount of time. There are too many people out there with too many different opinions, too many power centers with contrary interests, too much mutal distrust, and too many other people with too much skill at deflecting any kind of policy initiative down ways that sort of look like they serve the original purpose, but mostly don't. The devil is often in the details.
If it is possible to get the system to do that, I know that I am not capable of doing so. I mean, I'll vote for it, maybe make write some letters, but I know from experience that I have nearly no ability to persuade the sorts of people who'd need to be persuaded.
I am also not capable of solving the technical problem myself and doing some "pivotal act". In fact I'm pretty sure I have no technical ideas for things to try that aren't obvious to most specialists. And I don't much buy any of the the ideas I've heard from other people.
My only real hopes are things that neither I nor anybody else can influence, especially not in any predictable direction, like limitations on intelligence and uncertainty about doom.
So my personal solution is to read random stuff, study random things, putter around in my workshop, spend time with my kid, and generally have a good time.
Replying to myself to clarify this:
A climate change defector also doesn't get to "align" the entire future with the defector's chosen value system.
I do understand that the problem with AGI is exactly that you don't know how to align anything with anything at all, and if you know you can't, then obviously you shouldn't try. That would be stupid.
The problem is that there'll be an arms race to become able to do so... and a huge amount of pressure to deploy any solution you think you have as soon as you possibly can. That kind of pressure leads to motivated cognition and institutional failure, so you become "sure" that something will work when it won't. It also leads to building up all the prerequisite capabilities for a "pivotal act", so that you can put it into practice immediately when (you think) you have an alignment solution.
... which basically sets up a bunch of time bombs.
Are you saying that I'd have to kill everyone so noone can build AGI?
Yup. Anything short of that is just a delaying tactic.
From the last part of your comment, you seem to agree with that, actually. 1000 years is still just a delay.
But I didn't see you as presenting preventing fully general, self-improving AGI as a delaying tactic. I saw you as presenting it as a solution.
Also, isn't suppressing fully general AGI actually a separate question from building narrow AI? You could try suppress fully general AGI and narrow AI. Or you could build narrow AI while still also trying to do fully general AGI. You can do either with or without the other.
you have to provide evidence that a) this is distracting relevant people from doing things that are more productive (such as solving alignment?)
I don't know if it's distracting any individuals from finding any way to guarantee good AGI behavior[1]. But it definitely tends to distract social attention from that. Finding one "solution" for a problem tends to make it hard to continue any negotiated process, including government policy development, for doing another "solution". The attitude is "We've solved that (or solved it for now), so on to the next crisis". And the suppression regime could itself make it harder to work on guaranteeing behavior.
True, I don't don't know if the good behavior problem can be solved, and am very unsure that it can be solved in time, regardless.
But at the very least, even if we're totally doomed, the idea of total, permanent suppression distracts people from getting whatever value they can out of whatever time they have left, and may lead them to actions that make it harder for others to get that value.
AND b) that solving alignment before we can build AGI is not only possible, but highly likely.
Oh, no, I don't think that at all. Given the trends we seem to be on, things aren't looking remotely good.
I do think there's some hope for solving the good behavior problem, but honestly I pin more of my hope for the future on limitations of the amount of intelligence that's physically possbile, and even more on limitations of what you can do with intelligence no matter how much of it you have. And another, smaller, chunk on it possibly turning out that a random self-improving intelligence simply won't feel like doing anything that bad anyway.
... but even if you were absolutely sure you couldn't make a guaranteed well-behaved self-improving AGI, and also absolutely sure that a random self-improving AGI meant certain extinction, it still wouldn't follow that you should turn around and do something else that also won't work. Not unless the cost were zero.
And the cost of the kind of totalitarian regime you'd have to set up to even try for long-term suppression is far from zero. Not only could it stop people from enjoying what remains, but when that regime failed, it could end up turning X-risk into S-risk by causing whatever finally escaped to have a particularly nasty goal system.
For all the people who continuously claim that it's impossible to coordinate humankind into not doing obviously stupid things, here are some counter examples: We have the Darwin awards for precisely the reason that almost all people on earth would never do the stupid things that get awarded. A very large majority of humans will not let their children play on the highway, will not eat the first unknown mushrooms they find in the woods, will not use chloroquine against covid, will not climb into the cage in the zoo to pet the tigers, etc.
Those things are obviously bad from an individual point of view. They're bad in readily understandable ways. The bad consequences are very certain and have been seen many times. Almost all of the bad consequences of doing any one of them accrue personally to whoever does it. If other people do them, it still doesn't introduce any considerations that might drive you to want to take the risk of doing them too.
Yet lots of people DID (and do) take hydroxychloroquine and ivermectin for COVID, a nontrivial number of people do in fact eat random mushrooms, and the others aren't unheard-of. The good part is that when somebody dies from doing one of those things, everybody else doesn't also die. That doesn't apply to unleashing the killer robots.
... and if making a self-improving AGI were as easy as eating the wrong mushrooms, I think it would have happened already.
The challenge here is not the coordination, but the common acceptance that certain things are stupid.
Pretty much everybody nowadays has a pretty good understanding of the outlines of the climate change problem. The people who don't are the pretty much the same people who eat horse paste. Yet people, in the aggregate, have not stopped making it worse. Not only has every individual not stopped, but governments have been negotiating about it for like 30 years... agreeing at every stage on probably inadequate targets... which they then go on not to meet.
... and climate change is much, much easier than AGI. Climate change rules could still be effective without perfect compliance at an individual level. And there's no arms race involved, not even between governments. A climate change defector may get some economic advantage over other players, but doesn't get an unstoppable superweapon to use against the other players. A climate change defector also doesn't get to "align" the entire future with the defector's chosen value system. And all the players know that.
Speaking of arms races, many people think that war is stupid. Almost everybody thinks that nuclear war is stupid, even if they don't think nuclear deterrence is stupid. Almost everybody thinks that starting a war you will lose is stupid. Yet people still start wars that they will lose, and there is real fear that nuclear war can happen.
This is maybe hard in certain cases, but NOT impossible. Sure, this will maybe not hold for the next 1,000 years, but it will buy us time.
I agree that suppressing full-bore self-improving ultra-general AGI can buy time, if done carefully and correctly. I'm even in favor of it at this point.
But I suspect we have some huge quantitative differences, because I think the best you'll get out of it is probably less than 10 years, not anywhere near 1000. And again I don't see what substituting narrow AI has to do with it. If anything, that would make it harder by requiring you to tell the difference.
I also think that putting too much energy into making that kind of system "non-leaky" would be counterproductive. It's one thing to make it inconvenient to start a large research group, build a 10,000-GPU cluster, and start trying for the most agenty thing you can imagine. It's both harder and more harmful to set up a totalitarian surveillance state to try to control every individual's use of gaming-grade hardware.
And there are possible measures to reduce the ability of the most stupid 1% of humanity to build AGI and kill everyone.
What in detail would you like to do?
I don't like the word "alignment" for reasons that are largely irrelevant here. ↩︎
On further edit: apparently I'm a blind idiot and didn't see the clearly stated "5 year time horizon" despite actively looking for it. Sorry. I'll leave this here as a monument to my obliviousness, unless you prefer to delete it.
Without some kind of time limit, a bet doesn't seem well formed, and without a reasonably short time limit, it seems impractical.
No matter how small the chance that the bet will have to be paid, it has to be possible for it to be paid, or it's not a bet. Some entity has to have the money and be obligated to pay it out. Arranging for a bet to be paid at any time after their death would cost more than your counterparty would get out of the deal. Trying to arrange a perpetual trust that could always pay is not only grossly impractical, but actually illegal in a lot of places. Even informally asking people to hold money is really unreliable very far out. And an amount of money that could be meaningful to future people could end up tied up forever anyway, which is weird. Even trying to be sure to have the necessary money until death could be an issue.
I'm not really motivated to play, but as an example I'm statistically likely to die in under 25 years barring some very major life extension progress. I'm old for this forum, but everybody has an expiration date, including you yourself. Locating your heirs to pay them could be hard.
Deciding the bet can get hard, too. A recognizable Less Wrong community as such probably will not last even 25 years. Nor will Metaculus or whatever else. A trustee is not going to have the same judgement as the person who originally took your bet.
That's all on top of the more "tractable" long-term risks that you can at least value in somehow... like collapse of whatever currency the bet is denominated in, AI-or-whatever completely remaking the economy and rendering money obsolete, the Rapture, etc, etc.
... but at the same time, it doesn't seem like there's any particular reason to expect definitive information to show up within any adequately short time.
On edit: I bet somebody's gonna suggest a block chain. Those don't necessarily have infinite lives, either, and the oracle that has to tell the chain to pay out could disappear at any time. And money is still tied up indefinitely, which is the real problem with perpetuities.