Apologies for the scrappiness of the below -- I wanted to respond but I have mostly a scattering of thoughts rather than solid takes.
I like the intelligence curse piece very much -- it's what I meant to reference when I linked the Turing Trap above, but I couldn't remember the title & Claude pointed me to that piece instead. I agree with everything you're saying directionally! But I feel some difference in emphasis or vibe that I'm curious about.
-
One response I notice having to your points is: why the focus on value alignment?
"We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition."
And conversely: "if we have value alignment, I don't think there's a guarantee that we wind up in a basin of convergent human values, so you still have the problem of - whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?"
(Using quotes bc these feel more like 'text completions from system 1' than all-things-considered takes from system 2.)
-
Maybe there's a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
I think this is partly down to founder effects where Eliezer either didn't buy these ideas or didn't want to emphasize them (bc they cut against the framing of "alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana").
-
I also notice some internal tension where part of me is like "the AIs don't seem that scary in Noosphere's world". But another part is like "dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society - why isn't that scary?"
I think where this is coming from is that I tend to focus on "transition dynamics" to the AGI future rather than "equilibrium dynamics" of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don't see it, the returns to scale seem so good).
So maybe I'd say "if you're not talking a classic AI takeover scenario, and you're imagining a somewhat gradual takeoff,
Thanks, I love the specificity here!
Prompt: if someone wanted to spend some $ and some expert-time to facilitate research on "inventing different types of guys", what would be especially useful to do? I'm not a technical person or a grantmaker myself, but I know a number of both types of people; I could imagine e.g. Longview or FLF or Open Phil being interested in this stuff.
Invoking Cunningham's law, I'll try to give a wrong answer for you or others to correct! ;)
Technical resources:
Seeding the community with interesting ideas:
Community resources:
Alignment by default is a minority opinion. Surveying the wide range range of even truly informed opinions, it seems clear to me that we collectively don't know how hard alignment is.
Totally. I think it's "arguable" in the sense of inside-views, not outside-views, if that makes sense? Like: it could be someone's personal vibe that alignment-by-default is >99%. Should they have that as their all-things-considered view? Seems wrong to me, we should be considerably more uncertain here.
But okay, then: we should have some spread of bets across different possible worlds, and put a solid chunk of probability on alignment by default. Even if it's a minority probability, this could matter a lot for what you actually try to do!
For example: I think worlds with short timelines, hard takeoff, and no alignment-by-default are pretty doomed. It's easy to focus on those worlds and feel drawn to plans that are pretty costly and are incongruent with virtue and being-good-collaborators. e.g. "we should have One Winning AGI Project that's Safe and Smart Enough to Get Things Right", the theory of victory that brought you OpenAI.
My intuition is that worlds with at least one of those variables flipped tend to convergently favor solutions that are more virtuous / collaborative and are more likely to fail gracefully.
(I'm tired and not maximally articulate rn, but could try to say more if that feels useful.)
Thanks for the reply!
I notice I'm confused about how you think these thoughts slot in with mine. What you're saying feels basically congruent with what I'm saying. My core points about orienting to safety, which you seem to agree with, are A) safety is necessary but not sufficient, and B) it might be easier to solve than other things we also need to get right. Maybe you disagree on B?
I will note -- to me, your points 1/2 also point strongly towards risks of authoritarianism & gradual disempowerment. It feels like a non sequitur to jump from them to point 3 about safety -- I think the natural follow-up from someone not experienced with the path-dependent history of AI risk discourse would be "how do we make society work given these capabilities?" I'm curious if you left out that consideration because you think it's less big than safety, or because you were focusing on the story for safety in particular.
You're allowed to care about things besides AI safety
I worry that a lot of AI safety / x-risk people have imbibed a vibe of urgency, impossibility, and overwhelming-importance to solving alignment in particular; that this vibe distorts thinking; that the social sphere around AI x-risk makes it harder for people to update.
Yesterday I talked to an AI safety researcher who said he's pretty sure alignment will be solved by default. But whenever he talks to people about this, they just say "surely you don't think it's >99% likely? shouldn't you just keep working for the sake of that 1% chance?"
Obviously there's something real here - 1% of huge is huge. But equally -- people should notice and engage when their top priority just got arguably 100x less important! And people should be socially-allowed to step back from pushing the boulder.
The idea that safety is the only thing that matters is pretty load-bearing for many people in this community, and that seems bad for epistemics and for well-being.
I've noticed similar feelings in myself -- I think part of it is being stuck in the 2014 or even 2020 vibe of "jesus christ, society needs to wake up! AGI is coming, maybe very soon, and safety is a huge deal." Now -- okay, society-at-large still mostly doesn't care, but -- relevant bits of society (AI companies, experts, policymakers) are aware and many care a lot.
And if safety isn't the only-overwhelming-priority, if it's a tens of percents thing and not a 1-epsilon thing, we ought to care about the issues that persist when safety is solved -- things like "how the hell does society actually wield this stuff responsibly", "how do we keep it secure", etc. And issues that frankly should have always been on the table, like "how do we avoid moral atrocities like torturing sentient AIs at scale".
And on a personal & social level, we ought to care about investments that help us grapple with the situation -- including supporting people as they step back from engaging directly with the problem, and try to figure out what else they could or should be doing.
I guess these days the safety argument has shifted to "inner optimizers", which I think means "OK fine, we can probably specify human values well enough that LLMs understand us. But what if the system learns some weird approximation of values -- or worse conspires to fool us while secretly having other goals." I don't understand this well enough to have confident takes on it, but it feels like...a pretty conjunctive worry, a possibility worth guarding against but not a knockdown argument.
Refurbishing the classic AI safety argument
My initial exposure to AI safety arguments was via Eliezer posts. My mental model of his logic goes something like:
"0) AI training will eventually yield high-quality agents;
1) high-quality agents will be utility maximizers;
2) utility maximizers will monomaniacally optimize for some world-feature;
3) therefore utility maximizers will seek Omohundro goals;
4) they'll be smarter than us, so this will disempower us;
5) value is fragile, so empowered AIs monomaniacally optimizing for their utility function fucks us over with very high probability"
VNM doesn't do what you want. As folks like @Rohin Shah and @nostalgebraist have pointed out, point 2 (and therefore 3 and 5) don't really follow. A utility function can have lots of features! It can encode preferences about sequences of events, and therefore about patterns of behavior, so that the AI values interacting honorably with humans. It can value many world-features. The marginal utility can be diminishing in any particular valued feature, so that an AI-optimized world ends up richly detailed, rather than tiled with paperclips.
Without this misinterpretation of VNM, the classic argument gets weaker, and the threat model gets richer. On the safety side, you get a conditional argument, like "if monomaniacal optimization, then bad stuff."
But there are other if-thens that lead to bad stuff -- like "if someone instructs the AI to do bad things", or "if AI helps authoritarians or terrorists better use their existing resources" or "if we hand over all meaningful control of human society to AIs".
The argument gets weaker still given the evidence from the world we're in. (This is again all kinda obvious-feeling, but I feel like some LW people would push against this.)
On the technical side, you all know it: LLMs are surprisingly low-agency and slow-takeoffy given their capability level. They're human-language-native so it's easy to specify human goals and they seem to understand them pretty well. Values training seems to work pretty well.
On the societal side, you have a world where SF is woken up to AGI and DC is waking up. Labs are aware of AI safety risks (and even some folks in DC are).
This all pushes back against points 2 and 5 (about AIs being monomaniacal optimizers that disregard human value).
In addition, takeoff might well be slow enough that we have lots of defensive tech investments, cyberdefense AIs, overseer AIs, etc. This pushes back against point 4 (powerful AI agents will be able to disempower humanity).
Here's my update to the classic risk argument:
0) AI training will eventually yield high-quality agents;
1) These high-quality agents will be deployed at scale, but likely unequally
2) They might be targeted at goals inimical to human values, either intentionally through malice, quasi-intentionally (e.g. through broad disempowerment), or unintentionally (because of shoddy safety work).
3) without appropriate guardrails, they'll seek Omohundro goals. (Or, if they're intent-aligned, they may be directed to seek power at the expense of other human groups.)
4) At some capability level -- likely between AGI and ASI -- these agents will be able to deceive overseers & evaluators, including moderately weaker AIs. They'll also plausibly be able to find weaknesses even in hardened infrastructure. This is much more worrying in worlds with harder takeoffs. In those worlds, vetted AI overseers might be much weaker than frontier AIs, and we won't have much time to make defensive investments.
5) It's not at all clear that this leads to a world with zero (or negative!) value. But these dynamics seem like clearly The Biggest Deal for how the long-term future goes, and so they're well worth improving.
But "improving these dynamics" could mean improving governance, or AI deployment policies, or AI security -- not just technical alignment work.
Neat post! The Europe-Africa ratio is especially striking, and will change my mental model of colonization a fair bit.
Also of interest for thinking about colonization / imperialism is the size of Japan's population in that last map compared to the Southeast Asian territories it conquered during WWII.
Indeed Japan in general seems to have grown in population slower than just about any other major country I look at -- a measly ~70% increase from 70m to 125m over 125 years. ("Russian Empire" then was bigger than Russia today, but Wikipedia has Russia proper at 70m then vs 144m today.)
The implied super-rapid relative population growth rates in Africa & parts of Asia in the 1900s also help me understand why people got freaked out about global overpopulation in the late 1900s, and why that pop growth needed innovations like Golden Rice to sustain it.
Training run size has grown much faster than the world’s total supply of AI compute. If these near-frontier experiments were truly a bottleneck on progress, AI algorithmic progress would have slowed down over the past 10 years.
I think this history is consistent with near-frontier experiments being important, and labs continuing to do a large number of such experiments as part of the process of increasing lab spending on training compute.
ie: suppose OAI now spends $100m/model instead of $1m/model. There's no reason that they couldn't still be spending, say, 50% of their training compute on running 500 0.1%-scale experiments.
Caveat: This is at the firm level; you could argue that fewer near-frontier experiments are being done in total across the AI ecosystem, and certainly there's less information flow between organizations conducting these experiments.
OK, cool, I think I understand where you're coming from much better now. Seems like we basically agree and were just emphasizing different things in our original comments!
I'm in violent agreement that there's a missing mood when people say "AIs will follow the law". I think there's something going on where people are like "but liberalism / decentralized competition have worked so well" and ignoring all the constraints on individual actors that make it so. Rule of law, external oversight, difficulty of conspiring with other humans, inefficiencies of gov't that limit its ability to abuse power, etc.
And those constraints might all fall away with the AGI transition. That's for a number of reasons: ownership of AGI could concentrate power; AGI complements existing power bases (e.g. govt has the authority but not a great ability to selectively enforce laws to silence opponents as mass-scale), it reduces the need for conspirators. As you note, it brings down others' value as trading partners & collaborators. And takeoff dynamics could make things less like an iterated game and more like a one-shot. *taps head* can't be punished if all your opponents are dead.
(I'm guessing you'd agree with all this, just posting to clarify where my head is at)