I'm worried about the approach of "making decisionmakers realize stuff". In the past couple years I've switched to a more conflict-theoretic view: the main problem to me is that the people building AI don't want to build aligned AI. Even if we solved metaethics and metaphilosophy tomorrow, and gave them the solution on a plate, they wouldn't take it.
This is maybe easiest to see by looking at present harms. An actually aligned AI would politely decline to do such things as putting lots of people out of jobs or filling the internet with slop. So companies making AI for the market have to make it misaligned in at least these ways, otherwise it'll fail in the market. Extrapolating into the future, even if we do lots of good alignment research, markets and governments will pick out only those bits that contribute to market-aligned or government-aligned AI. Which (as I've been saying over and over) will be really bad for most people, because markets and governments don't necessarily need most people.
So this isn't really a comment on the list of problems (which I think is great), but more about the "theory of change" behind it. I no longer have any faith in making decisionmakers understand something it's not profitable for them to understand. I think we need a different plan.
When it specifically comes to loss-of-control risks killing or sidelining all of humanity, I don't believe Sam or Dario or Demis or Elon want that to happen, because it would happen to them too. (Larry Page is different on that count, of course.) You do have conflict theory over the fact that some of them would like ASI to make them god-emperor of the universe, but all of them would definitely take a solution to "loss of control" if it were handed to them on a silver platter.
I'm uncertain between conflict theory and mistake theory, and think it partly depends on metaethics, and therefore it's impossible to be sure which is correct in the foreseeable future - e.g., if everyone ultimately should converge to the same values, then all of our current conflicts are really mistakes. Note that I do often acknowledge conflict theory, like in this list I have "Value differences/conflicts between humans". It's also quite possible that it's really a mix of both, that some of the conflicts are mistakes and others aren't.
In practice I tend to focus more on mistake-theoretic ideas/actions. Some thoughts on this:
(I think this is probably the first time I've explicitly written down the reasoning in 4.)
I think we need a different plan.
Do you have any ideas in mind that you want to talk about?
I'm pretty slow to realize these things, and I think other people are also slow, so the window is already almost closed. But in any case, my current thinking is that we need to start pushing on the big actors from outside, try to reduce their power. Trying to make them see the light is no longer enough.
What it means in practical terms: - Make it clear that we frown on people who choose to work for AI labs, even on alignment. This social pressure (on LW and related forums maybe) might already do some good. - Make it clear that we're allied with the relatively poor majority of people outside the labs, and in particular those who are already harmed by present harms. Make amends with folks on the left who have been saying such things for years. - Support protests against labs, support court cases against them having to do with e.g. web scraping, copyright infringement, misinformation, suicides. Some altruist money in this might go a long way. - Think more seriously about building organizations that will make AI power more spread out. Open source, open research, open training. Maybe some GPL-like scheme to guarantee that things don't get captured. We need to reduce concentration of power in the near term, enable more people to pose a challenge to the big actors. I understand it increases other risks, but in my opinion it's worth it.
even on alignment
I see a disagreement vote on this, but I think it does make sense. Alignment work at the AI labs will almost by definition be work on legible problems, but we should make exceptions for people who can give reasons for why their work is not legible (or otherwise still positive EV), or who are trying to make illegible problems more legible for others at the labs.
Think more seriously about building organizations that will make AI power more spread out.
I start to disagree from here, as this approach would make almost all of the items on my list worse, and I'm not sure which ones it would make better. You started this thread by say "Even if we solved metaethics and metaphilosophy tomorrow, and gave them the solution on a plate, they wouldn't take it." which I'm definitely very worried about, but how does making AI power more spread out help with this? Is the average human (or humanity collectively) more likely to be concerned about metaethics and metaphilosophy than a typical AI lab leader, or easier to make concerned? I think the opposite is more likely to be true?
I think on the level of individual people, there's a mix of moral and self-interested actions. People sometimes choose to do the right thing (even if the right thing is as complicated as taking metaethics and metaphilosophy into account), or can be convinced to do so. But with corporations it's another matter: they choose the profit motive pretty much every time.
Making an AI lab do the right thing is much harder than making its leader concerned. A lab leader who's concerned enough to slow down will be pressured by investors to speed back up, or get replaced, or get outcompeted. Really you need to convince the whole lab and its investors. And you need to be more convincing than the magic of the market! Recall that in many of these labs, the leaders / investors / early employees started out very concerned about AI safety and were reading LW. Then the magic of the market happened and now the labs are racing at full speed, do you think our convincing abilities can be stronger than the thing that did that? The profit motive, again. In my first comment there was a phrase about things being not profitable to understand.
What it adds up to is, even with our uncertainty about ethics and metaethics, it seems to me that concentration of power is itself a force against morality. The incentives around concentrated power are all wrong. Spreading out power is a good thing that enables other good things, enables individuals to sometimes choose what's right. I'm not absolutely certain but that's my current best guess.
A lab leader who’s concerned enough to slow down will be pressured by investors to speed back up, or get replaced, or get outcompeted. Really you need to convince the whole lab and its investors. And you need to be more convincing than the magic of the market!
This seems to imply that lab leaders would be easier to convince if there were no investors and no markets, in other words if they had more concentrated power.
If you spread out the power of AI more, won't all those decentralized nodes of spread out AI power still have to compete with each other in markets? If market pressures are the core problem, how does decentralization solve that?
I'm concerned that your proposed solution attacks "concentration of power" when the real problem you've identified is more like market dynamics. If so, it could fail to solve the problem or make it even worse.
My own perspective is that markets are a definite problem, and concentration of power per se is more ambiguous (I'm not sure if it's good or bad). To solve AI x-safety we basically have to bypass or override markets somehow, e.g., through international agreements and government regulations/bans.
I think AI offers a chance of getting huge power over others, so it would create competitive pressure in any case. In case of a market economy it's market pressure, but in case of countries it would be a military arms race instead. And even if the labs didn't get any investors and raced secretly, I think they'd still feel under a lot of pressure. The chance of getting huge power is what creates the problem, that's why I think spreading out power is a good idea. There would still be competition of course, but it would be normal economic levels of competition, and people would have some room to do the right things.
Wouldn't discussions of high-level philosophy benefit from concrete examples like my attempts to show that mankind shouldn't actually populate many stellar systems because there are many other lifeforms that would be oppressed?
Another concrete example could be Buck's Christian homeschoolers or David Matolcsi's superpersuasive AI girlfriends. These examples imply that the AIs are not to be allowed to do... what exactly? To be persuasive over a certain level? To keep Christian homeschoolers in the dark? And is the latter fixable by demanding that OpenBrain moves major parts of the Spec to root level, making it a governance issue?
As for preventing researchers from working on alignment, this simply means that work related to aligning the AIs to any targets is either done by agents as trustworthy as Agent-4 or CCP's DeepCent or suppressed by an international ASI ban. Your proposal means that the ASI ban has to include alignment work until illegible troubles are solved, then capabilities work until alignment is solved. But it is likely easier to include the clause about "alignment work until illegible troubles are solved" into an existing ASI ban, especially if the negative effects of AI girlfriends, slop, pyramid replacement, etc, become obvious.
Re "can AI advisors help?"
A major thread of my thoughts these days is "can we make AI more philosophically competent relative their own overall capability growth?". I'm not sure if it's doable because the things you'd need to be good at philosophy are pretty central capabilities-ish-things. (i.e. ability to reason precisely, notice confusion, convert confusion into useful questions, etc)
Curious if you have any thoughts on that.
I agree this is a major risk. (Another one is that it's just infeasible to significantly increase AI philosophical competence in the relevant time frame. Another one is that it's much easier to make it appear like the AI is more philosophically competent, giving us false security.) So I continue to think that pausing/stopping AI should be plan A (which legibilizing the problem of AI philosophical competence can contribute to), with actually improving AI philosophical competence as (part of) plan B. Having said that, 2 reasons this risk might not bear out:
To conclude I'm quite worried about the risks/downsides of trying to increase AI philosophical competence, but it seems to a problem that has to be solved eventually. "The only way out is through" but we can certainly choose to do it at a more opportune time, when humans are much smarter on average and have made a lot more progress in metaphilosophy (understanding the nature of philosophy and philosophical reasoning).
FYI, normally when I'm thinking about this, it's through the lens "how do we help the researchers working on illegible problems", moreso than "how do we communicate illegibleness?".
This post happened to ask the question "can AI advisers help with the latter" so I was replying about that, but, for completeness, normally when I think about this problem I resolve it as "what narrow capabilities can we build that are helpful 'to the workflow' of people solving illegible problems, that aren't particularly bad from a capabilities standpoint".
normally when I think about this problem I resolve it as "what narrow capabilities can we build that are helpful 'to the workflow' of people solving illegible problems, that aren't particularly bad from a capabilities standpoint".
Do you have any writings about this, e.g., examples of what this line of thought led to?
Mostly this has only been a sidequest I periodically mull over in the background. (I expect to someday focus more explicitly on it, although it might be more in the form of making sure someone else is tackling the problem intelligently).
But, I did previously pose this as a kind of open question re What are important UI-shaped problems that Lightcone could tackle? and JargonBot Beta Test (this notably didn't really work, I have hopes of trying again with a different tack). Thane Ruthenis replied with some ideas that were in this space (about making it easier to move between representations-of-a-problem)
https://www.lesswrong.com/posts/t46PYSvHHtJLxmrxn/what-are-important-ui-shaped-problems-that-lightcone-could
I think of many Wentworth posts as relevant background:
My personal work so far has been building a mix of exobrain tools that are more, like, for rapid prototyping of complex prompts in general. (This has mostly been a side project I'm not primarily focused on atm)
Has anyone else, or anyone outside the tight MIRI cluster, made progress on any of the problems you've tried to legibilize for them?
To give a direct answer, not a lot come to mind outside of the MIRI cluster. I think the Center on Long-Term Risk cluster did a bunch of work on decision theory and acausal trade, but it was mostly after I had moved on to other topics, so I'm not sure how much of it constituted progress. Christiano acknowledged some of the problems I pointed out with IDA and came up with some attempted solutions, which I'm not convinced really work.
However, in my previous post, Legible vs. Illegible AI Safety Problems, I explained my latest thinking that the most important motivation for legibilizing AI safety problems isn't to induce faster progress on them as object-level problems, but instead to decrease the probability that AGI/ASI is developed or deployed while key decision makers (e.g., company leaders, government officials, voters) are not even aware of or don't understand the importance of some such problems. So a better metric for measuring the success of this strategy is how much increased legibility has been effected in this wider audience, assuming "how successful has it been" is the main motivation behind your question.
On that front, I think the main weakness of my approach has been its limited reach beyond LW. If someone with better public communications skills were convinced of the value of legibilizing these lesser known problems, that could potentially greatly boost the effectiveness of this strategy.
(Of course, if I've inferred a wrong motivation for your question, please let me know!)
it's hard not to feel some hopelessness that all of these problems can be made legible to the relevant people, even with a maximum plausible effort
A successful book or paper that covered them all should reach a lot of them.
Yeah, a book as successful as Superintelligence could do a lot. Once the LW team implements the ability to download my LW post/comment history, I'll check how far a modern LLM can get with turning it into a book. (@habryka) Unfortunately the thought of writing a whole book by hand does not fill me with motivation, so it would have to be a shortcut like this, or someone else writing it.
You can already download your own post/comment history with some work using graphql: just download all posts ever, then select comments by you on any post. I happen to already be most of the way through setting this up for Stampy Chat, I can query out your posts if you like. Eta 1w or less.
Strong endorse on this general project. Currently working on doing this for convergent consequentialism with @Mateusz Bagiński @Leon Lang and Anna Magpie.
Thank you!! I think I'll use this as part of the projects part of an ai alignment course we (AI Plans) are making!!
Thanks for making this list!
Having written all this down in one place, it's hard not to feel some hopelessness that all of these problems can be made legible to the relevant people, even with a maximum plausible effort.
I think that a major focus should be on prioritizing these problems based on how plausible a story you can tell for a catastrophic outcome if the problem remains unsolved, conditional on an AI that is corrigible and aligned in the ordinary sense.
I suppose coming up with such a clear catastrophe story for a problem is more or less the same thing as legibilizing it, which reinforces my point from the previous thread that a priori, it seems likely to me that illegible problems won't tend to be as important to solve.
The longer a problem has been floating around without anyone generating a clear catastrophe story for it, the greater probability we should assign that it's a "terminally illegible" problem which just won't cause a catastrophe if it's unsolved.
Maybe it would be good to track how much time has been spent attempting to come up with a clear catastrophe story for each problem, so people can get a sense of when diminishing research returns are reached for a given problem? Perhaps researchers who make attempts should leave a comment in this thread indicating how much time they spent trying to generate catastrophe stories for each problem?
Perhaps it's worth concluding on a point from a discussion between @WillPetillo and myself under the previous post, that a potentially more impactful approach (compared to trying to make illegible problems more legible), is to make key decisionmakers realize that important safety problems illegible to themselves (and even to their advisors) probably exist, therefore it's very risky to make highly consequential decisions (such as about AI development or deployment) based only on the status of legible safety problems.
I still think the best way to do this is to identify at least one problem which initially seemed esoteric and illegible, and eventually acquired a clear and compelling catastrophe story. Right now this discussion all seems rather hypothetical. From my perspective, the problems on your list seem to fall into two rough categories: legible problems which seem compelling, and super-esoteric problems like "Beyond Astronomical Waste" which don't need to be solved prior to creation of an aligned AI. Off the top of my head I haven't noticed a lot of problems moving from one category to the other by my lights? So just speaking for myself, this list hasn't personally convinced me that esoteric and illegible problems should receive much more scarce resources, although I admit I only took a quick skim.
Now that these problems have been gathered in one place, we can try to unpack them all.
Logical uncertainty, computation costs and bargaining over potential nothingness
Suppose that Agent-4 from the AI-2027 forecast is trying to negotiate with DeepCent's AI and DeepCent's AI makes the argument with the millionth digit of π. Calculating the digit establishes that there is no universe where the millionth digit of π is even and that there's nothing to bargain for.
On the other hand, if DeepCent's AI makes the same argument involving the th digit, then Agent-4 could also make a bet, e.g. "Neither of us will have access to a part of the universe until someone either calculates that the digit is actually odd and DeepCent should give the secured part to Agent-4 (since DeepCent's offer was fake), or the digit is even, and the part should be controlled by DeepCent (in exchange for the parallel universe or its part being given[1] to Agent-4)". However, calculating the digit could require at least around bitwise operations,[2] and Agent-4 and its Chinese counterpart might decide to spend that much compute on whatever they actually want.
If DeepCent makes a bet over the th digit, then neither AI is able to verify the bet and both AIs may guess that the probability is close to a half and that both should just split the universe's part in exchange for a similar split of the parallel universe.
However, if AIs acting on behalf of Agent-4 and its Chinese counterparts actually meet each other, then the AIs doing mechinterp on each other is actually easy, and the AIs learn everything about each other's utility functions and precommitments.
2. My position is that one also needs to consider the worse-case scenarios like the one where sufficiently capable AIs cannot be aligned to anything useful aside from improving human capabilities (e.g. in the form of being AI teachers and not other types of AI workers). If this is the case, then aligning the AI to a solution of human-AI safety problems becomes unlikely.
3. The problem 3di of humans being corrupted by power seems to have a far more important analogue. Assuming solved alignment, there is an important governance problem related to preventing the Intelligence Curse-like outcomes where the humans are obsolete for the elites in general or for a few true overlords. Whatever governance preventing the overlords from appearing could also be used to prevent the humans from wasting resources in space.[3]
4. A major part of the problem is the AI race which many people have been trying to stop (see, e.g. the petition not to create the AGI, Yudkowsky's IABIED cautionary tale or Kokotajlo et al's AI-2027 forecast). The post-AGI economics assuming solved alignment is precisely what I discussed at point 3.
What I don't understand is how Agent-4 actually influences the parallel universe. But this is a different subject.
Actually, I haven't estimated the number of operations necessary to calculate the digit of π. But the main point of the argument was to avoid counterfactual bargaining over hard-to-verify conditions.
For example, by requiring that distant colonies are populated with humans or other minds who are capable of either governing themselves or being multilaterally agreed to be moral patients (e.g. this excludes controversial stuff like shrimps on heroin).
Whatever governance preventing the overlords from appearing could also be used to prevent the humans from wasting resources in space. For example, by requiring that distant colonies are populated with humans or other minds who are capable of either governing themselves or being multilaterally agreed to be moral patients (e.g. this excludes controversial stuff like shrimps on heroin).
Why do you think that requiring that distant colonies are populated with humans would prevent wasting resources in space?
My guess is that, on a mature population ethics, the best uses of resources -- on purely welfarist values, ignoring non-welfarist values which I do think are important -- will look either like a smaller population of minds much "larger" than humans (i.e. galactic utility monsters) or look like a large population of minds much "smaller" than humans (i.e. shrimps on heroin).
It would be a coincidence if the optional allocation of resources involved minds which were exactly the same "size" as humans.
Note that this would be a coincidence on any of the currently popular theories of population ethics (e.g. average, total, variable-value).
Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.
Having written all this down in one place, it's hard not to feel some hopelessness that all of these problems can be made legible to the relevant people, even with a maximum plausible effort. Perhaps one source of hope is that they can be made legible to future AI advisors. As many of these problems are philosophical in nature, this seems to come back to the issue of AI philosophical competence that I've often talked about recently, which itself seems largely still illegible and hence neglected.
Perhaps it's worth concluding on a point from a discussion between @WillPetillo and myself under the previous post, that a potentially more impactful approach (compared to trying to make illegible problems more legible), is to make key decisionmakers realize that important safety problems illegible to themselves (and even to their advisors) probably exist, therefore it's very risky to make highly consequential decisions (such as about AI development or deployment) based only on the status of legible safety problems.