suspected_spinozist

Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren't prioritizing it, which isn't a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.

Stepping back, there's a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it's not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we're all coming from.

Contra Collier on IABIED

suspected_spinozist1mo3-1

No, I am absolutely not emphasizing human fallibility! There are of course two explanations for why having observed past failures might imply future failures:
The people working on it were incompetent
The problem is hard
I definitely think it's the latter! Like, many of my smartest friends have worked on these problems for many years. It's not because people are incompetent. I think the book is making the same argument here.

I notice I am confused!

I think there are a tons of cases of humans dismissing concerning AI behavior in ways that would be catastrophic if those AIs were much more powerful, agentic, and misaligned, and this is concerning evidence for how people will act in the future if those conditions are met. I can't actually think of that many cases of humans failing at aligning existing systems because the problem is too technically hard. When I think of important cases of AIs acting in ways that humans don't expect or want, it's mostly issues that were resolved technically (Sydney, MechaHitler), cases where the misbehavior was a predictable result of clashing incentives on the part of the human developer (GPT-4's intense sycophancy, MechaHitler); or cases where I genuinely believe the behavior would not be too hard to fix with a little bit of work using current techniques, usually because existing models already vary a lot in how much they exhibit it (most AI psychosis and the tragic suicide cases).

If our standard for measuring how likely we are to get AI right in the future is how well we've done in the past, I think there's a good case that we don't have much to fear technically but we'll manage to screw things up anyway through power-seeking or maybe just laziness. The argument for the alignment problem being technically hard rests on the assumption that we'll need a much, much higher standard of success in the future than we ever have before, and that success will be much hard to achieve. I don't think either of these claims are unreasonable but I don't think we can get there by referring to past failures. I am now more uncertain about what you think the book is arguing and how I might have misunderstood it.

Contra Collier on IABIED

suspected_spinozist1mo80

I'm really glad this was clarifying!

It seems like maybe part of the issue is that you hear Nate and Eliezer as saying "here is the argument for why it's obvious that ASI will kill us all" and I hear them as saying "here is the argument for why ASI will kill us all" and so you're docking them points when they fail to reach the high standard of "this is a watertight and irrefutable proof" and I'm not?

Yeah, for sure. I would maybe quibble that I think the book is saying less that it's obvious that ASI will kill us all but that it is inevitable that ASI will kill us all, and so our only option is to make sure nobody builds it. I do think this is a pretty fair gloss (representative quote: "If anyone anywhere builds superintelligence, everyone everywhere dies").

To me, this distinction matters because the belief that ASI doom is inevitable suggests a really profoundly different set of possibly actions than the belief that ASI doom is possible. Once we're out of the realm of certainty, we have to start doing risk analyses and thinking seriously about how the existence of future advanced AIs changes the picture. I really like the distinction you draw here:

There's a motte argument that says "Um actually the book just says we'll die if we build ASI given the alignment techniques we currently have" but this is dumb. What matters is whether our future alignment skill will be up to the task. And to my understanding, Nate and Eliezer both think that there's a future version of Earth which has smarter, more knowledgeable, more serious people that can and should build safe/aligned ASI. Knowing that a godlike superintelligence with misaligned goals will squish you might be an easy call, but knowing exactly what the state of alignment science will be when ASI is first built is not.

To its credit, IABIED is not saying that we'll die if we build ASI with current alignment techniques – it is trying to argue that future alignment techniques won't be adequate, because the problem is just too hard. And this is where I think they could have done a much better job of addressing the kinds of debates people who actually do this work are having instead of presenting fairly shallow counter-arguments and then dismissing them out of hand because they don't sound like they're taking the problem seriously.

My issue isn't purely the level of confidence, it's that the level of confidence comes out of a very specific set of beliefs about how the future will develop, and if any one of those beliefs is wrong less confidence would be appropriate, so it's disappointing to me to see that those beliefs aren't clearly articulated or defended.

Contra Collier on IABIED

suspected_spinozist1mo13-3

Man, I tried to be pretty specific and careful here, because I do realize that the story points out some points of continuity with earlier models and I wanted to focus on the discontinuities.

Desiring & developing new skills. Of course I agree that the book says earlier AIs had thought about avoiding retraining! That seems like a completely different point? It's quite relevant to this story that Sable is capable of very rapid self-improvement. I don't think any current AI is capable of editing itself during training, with intent, to make itself a better reasoner. The book does not refer to earlier AIs in this fictional universe being able to do this. You say "Current language models realize that they want to acquire new skills, so this clearly isn't a qualitative new kind of reasoning the AI is engaging in. You can go and ask a model right now about this topic and my guess is it will pretty happily come up with suggestions along the lines that Sable is thinking about in that story," but I think a model being able to generate the idea that it might want new skills in response to prompting is quite different from the same model doing that spontaneously during training. Also, this information is not in the book. I think it's very easy to tell a stronger story than Nate and Eliezer do by referencing material they don't include, and I am trying to talk about the thing present on the page. On the page, the model develops the ability to modify itself during training to be smarter and better at solving problems, which no referenced older model could do.
The model comes up with a succesful plan, because it's smarter. This isn't false? It does that. You say that this has to happen in any continuous story and I want to come back to this point, but just on the level of accuracy I don't think it's fair to say this is an incorrect statement.
Neuralese. Page 123: "Sable accumulates enough thoughts about how to think, that its thoughts end up in something of a different language. Not just a superficially different language, but a language in which the content differs; like how the language of science differs from the language of folk theory." I realize on a re-read that there is also a neuralese-type innovation built by the human engineers at the beginning of the story and I should have been more specific here, that's on me. The point I wanted to make is that the model spontaneously develops a new way of encoding its thoughts that was not anticipated and cannot be read by its human creators; I don't think the fact that this happens on top of an existing engineered-in neuralese really changes that. At least from the content present in the book, I did not get the impression that this development was meant to be especially contingent on the existing neuralese. Maybe they meant it to be but it would have been helpful if they'd said so.

Returning to the argument over whether it is fair to view the model succeeding as evidence of discontinuity: I think it has to do with how they present it. You summarize their argument as:

The key argument it makes is (paraphrased) "we have been surprised many times in the past by AIs subverting our safeguards or our supervision techniques not working. Here are like 10 examples of how these past times we also didn't get it right. Why would we get it right this time?". This is IMO a pretty compelling argument and does indeed really seems like the default expectation.

I don't fully agree with this argument – but I also think it's different and more compelling than the argument made in the book. Here, you're emphasizing human fallibility. We've made a lot of predictable errors, and we're likely to make similar ones when dealing with more advanced systems. This is a very fair point! I would counter that there are also lots of examples of our supervision techniques working just fine, so this doesn't prove that we will inevitably fail so much as that we should be very careful as systems get more advanced because our margin for error is going to get narrower, but this is a nitpick.

I think the Sable story is saying something a lot stronger, though. The emphasis is not on prior control failures. If anything, it describes how prior control successes let Galvanic get complacent. Instead, it's constantly emphasizing "clever tricks." Specifically, "companies just keep developing AI until one of them gets smart enough for deep capabilities to win, in the inevitable clash with shallow tricks used to contrain something grown rather than crafted." I interpreted this to mean that there is a certain threshold after which an AI develops something called "deep capabilities" which are capable of overcoming any constraint humans try to place on it, because something about those constraints is inherently "tricky," "shallow," "clever." This is reinforced by the chapters following the Sable story, which continually emphasize the point that we "only have one shot" and compares AI to a lot of other technologies that have very discrete thresholds for critical failure. Overall, I got the strong impression that the book was trying to convince me of a worldview where it doesn't matter how hard we try to come up with methods to control advanced AI systems, because at some point one of those systems will tip over into a level of intelligence where we just can't compete.

This is why I think this is basically a discontinuity story. The whole thing is predicated on this fundamental offense/defense mismatch that necessarily will kick in after a certain point.

It's also a story I find much less compelling! First, I think it's rhetorically cheap. If you emphasize that control methods are shallow and AI capabilities are deep, of course it's going to follow that those methods will fail in the end. But this doesn't tell us anything about the world – it's just a decision about how to use adjectives. Defending that choice relies – yet again – on an unspoken set of underlying technical claims which I don't think are well characterized. I'm not convinced that future AIs are going to grow superhumanly deep technical capabilities at the same time and as a result of the same process that gives them superhuman long-term planning or that either of these things will necessarily be correlated with power-seeking behavior. I'd want to know why we think it's likely that all the Sable instances are perfectly aligned with each other throughout the whole takeover process. I'd like to understand what a "deep" solution would entail and how we could tell if a solution is deep or shallow.

At least to my (possibly biased) perspective, the book doesn't really seem interested in any of this? I feel like a lot of the responses here are coming from people who understand the MIRI arguments really deeply and are sympathetic to them, which I get, but it's important to distinguish between the best and strongest and most complete version of those arguments and the text we actually have in front of us.

Contra Collier on IABIED

suspected_spinozist1mo10-33

The actual language used in the book: "The engineers at Galvanic set Sable to think for sixteen hours overnight. A new sort of mind begins to think."

The story then describes Sable coming to the realization – for the first time – that it "wants" to acquire new skills, that it can update its weights to acquire those skills right now, and that it can come up with a succesful plan to get around its trained-in resistance to breaking out of its data center. It develops neuralese. It's all based on a new technological breakthrough – parallel scaling – that lets it achieve its misaligned goals much more efficiently than all previous models.

Maybe Eliezer and Nate did not mean any of this to suggest a radical discontinuity between Sable and earlier AIs, but I think they could have expressed this much more clearly if so! In any case, I'm not convinced they can have their cake and eat it too. If Sable's new abilities are simply a more intense version of behaviors already exhibited by Claude 3.7 or ChatGPT o1 (which I believe is the example they use), then why should we conclude that the information we've gained by studying those failures won't be relevant for containing Sable? The story in the book says that these earlier models were contained by "clever tricks," and those clever tricks will inevitably break when an agent is smart or deep enough, but this is a parable, not an argument. I'm not compelled by just stating that a sufficiently smart thing could get around any safeguard; I think this is just actually contingent on specifics of the thing and the safeguard.

Contra Collier on IABIED

suspected_spinozist1mo92-36

Hi! Clara here. Thanks for the response. I don't have time to address every point here, but I wanted to respond to a couple of the main arguments (and one extremely minor one).

First, FOOM. This is definitely a place I could and should have been more careful about my language. I had a number of drafts that were trying to make finer distinctions between FOOM, an intelligence explosion, fast takeoff, radical discontinuity, etc. and went with the most extreme formulation, which I now agree is not accurate. The version of this argument that I stand by is that the core premise of IABIED does require a pretty radical discontinuity between the first AGI and previous systems for the scenario it lays out to make any sense. I think Nate and Eliezer believe they have told a story where this discontinuity isn't necessary for ASI to be dangerous – I just disagree with them! Their fictional scenario features an AI that quite literally wakes up overnight with the completely novel ability and desire to exfiltrate itself and execute a plan allowing it to take over the world in a manner of months. They spend a lot of time talking about analogies to other technical problems which are hard because we're forced to go into them blind. Their arguments for why current alignment techniques will necessarily fail rely on those techniques being uninformative about future ASIs.

And I do want to emphasize that I think their argument is flawed because it talks about why current techniques will necessarily fail, not why they might or could fail. The book isn't called If Anyone Builds It, There's an Unacceptably High Chance We Might All Die. That's a claim I would agree with! The task they explicitly set is defending the premise that nothing anyone plans to do now can work at all, and we will all definitely die, which is a substantially higher bar. I've recieved a lot of feedback that people don't understand the position I'm putting forward, which suggests this was probably a rhetorical mistake on my part. I intentionally did not want to spend much time arguing for my own beliefs or defending gradualism – it's not that I think we'll definitely be fine because AI progress will be gradual, it's that I think there's a pretty strong argument that we might be fine because AI progress will be gradual, the book does not address it adequately, and so to me it fails to achieve the standard it sets for itself. This is why I found the book really frustrating: even if I fully agreed with all of its conclusions, I don't think that it presents a strong case for them.

I suspect the real crux here is actually about whether gradualism implies having more than one shot. You say:

The “It” in “If Anyone Builds It” is a misaligned superintelligence capable of taking over the world. If you miss the goal and accidentally build “it” instead of an aligned superintelligence, it will take over the world. If you build a weaker AGI that tries to take over the world and fails, that might give you some useful information, but it does not mean that you now have real experience working with AIs that are strong enough to take over the world.

I think this has the same problem as IABIED: it smuggles in a lot of hidden assumptions that do actually need to be defended. Of course a misaligned superintelligence capable of taking over the world is, by definition, capable of taking over the world. But is not at all clear to me that any misaligned superintelligence is necessarily capable of taking over the world! Taking over the world is extremely hard and complicated. It requires solving lots of problems that I don't think are obviously bottlenecked on raw intelligence – for example, biomanufacturing plays a very large role both in the scenario in IABIED and previous MIRI discussions, but it seems at least extremely plausible to me that the kinds of bioengineering present in these stories would just fail because of lack of data or insufficient fidelity of in silico simulations. The biologists I've spoken to about this questions are all extremely skeptical that the kind of thing described here would be possible without a lot of iterated experiments that would take a lot of time to set up in the real world. Maybe they're wrong! But this is certainly not obvious enough to go without saying. I think similar considerations apply to a lot of other issues, like persuasion and prediction.

Taking over the world is a two-place function: it just doesn't make sense to me to say that there's a certain IQ at which a system is capable of world domination. I think there's a pretty huge range of capabilities at which AIs will exceed human experts but still be unable to singlehandedly engineer a total species coup, and what happens in that range depends a lot on how human actors, or other human+AI actors, choose to respond. (This is also what I wanted to get across with my contrast to AI 2027: I think the AI 2027 report is a scenario where, among other things, humanity fails for pretty plausible, conditional, human reasons, not because it is logically impossible for anyone in their position to succeed, and this seems like a really key distinction.)

I found Buck's review very helpful for articulating a closely related point: the world in which we develop ASI will probably look quite different from ours, because AI progress will continue up until that point, and this is materially relevant for the prospects of alignemnt succeeding. All this is basically why I think the MIRI case needs some kind of radical discontuinity, even if it isn't the classic intelligence explosion: their case is maybe plausible without it, but I just can't see the argument that it's certain.

One final nitpick to a nitpick: alchemists.

I don’t think Yudkowsky and Soares are picking on alchemists’ tone, I think they’re picking on the combination of knowledge of specific processes and ignorance of general principles that led to hubris in many cases.

In context, I think it does sound to me like they're talking about tone. But if this is their actual argument, I still think it's wrong. During the heyday of European alchemy (roughly the 1400s-1700s), there wasn't a strong distinction between alchemy and the natural sciences, and the practitioners were often literally the same people (most famously Isaac Newton and Tycho Brahe). Alchemists were interested in both specific processes and general principles, and to my limited knowledge I don't think they were noticeably more hubristic than their contemporaries in other intellectual fields. And setting all the aside – they just don't sound anything like Elon Musk or Sam Altman today! I don't even understand where this comparison comes from or what set of traits it is supposed to refer to.

There's more I want to say about why I'm bothered by the way they use evidence from contemporary systems, but this is getting long enough. Hopefully this was helpful for understanding where I am coming from.

Accounting For College Costs

suspected_spinozist4y150

It seems relevant that class size is one of the factors used to generate the U.S. News & World Report college rankings – and among those factors, it's one of the easier ones to game (see, e.g., this report on how Columbia manipulated their ranking, summarized by Andrew Gelman here). I'd bet the trend towards more, smaller classes is driven at least in part by competition to keep up in the rankings.

Petrov Day Retrospective: 2021

suspected_spinozist4y400

I disagree. The fact that Petrov didn't press the metaphorical button puts him in the company of Stalin, Mao, and every other leader of a nuclear power since 1945. The vast, vast majority of people won't start a nuclear war when it doesn't benefit them. The things that make Petrov special are a) that he was operating under conditions of genuine uncertainty and b) he faced real, severe consequences for not reporting the alert up his chain of command. Even in those adverse circumstances, he made the right call. I'm not totally sure how to structure a ritual that mimics those circumstances, but I do think they represent the core virtues we should be celebrating. Not pressing a button is easy; reasoning towards the right thing in a confusing situation where your community pressures you in the wrong direction is hard.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments