I think there's a fallacy in going from "slight caring" to "slight percentage of resources allocated."
Suppose that preserving earth now costs one galaxy that could be colonized later. Even if that one galaxy is merely one billionth of the total reachable number, it's still an entire galaxy ("our galaxy itself contains a hundred billion stars..."), and its usefulness in absolute terms is very large.
So there's a hidden step where you assume that the AIs that take over have diminishing returns (strong enough to overcome the hundred-billion-suns thing) on all their other desires for what they're going to do with the galaxies they'll reach, allowing a slight preference for saving Earth to seem worthwhile. Or maybe they have some strong intrinsic drive for variety, like how if my favorite fruit is peaches I still don't buy peaches every single time I go to the supermarket.
If I had no such special pressures for variety, and simply valued peaches at $5 and apples at $0.05, I would not buy apples 1% of the time, or dedicate 1% of my pantry to apples. I would just eat peaches. Or even if I had some modest pressure for variety but apples were only my tenth favorite fruit, I might be satisfied just eating my five favorite fruits and would never buy apples.
You describe this as "there are some reasons why small amounts of motivation don't suffice (given above) which are around 20% likely", but I think that's backwards. Small amounts of motivation by default don't suffice, but there's some extra machinery AIs could have that would make them matter.
Or to try a different analogy: Suppose a transformer model is playing a game where it gets to replace galaxies with things, and one of the things it can replace a galaxy with is "a preserved Earth." If actions are primitive, then there's some chance of preserving Earth, which we might call "the exponential of how much it cares about Earth, divided by the partition function" making a Boltzmann-rational modeling assumption. But if a model with similar caring it has to execute multi-step plans to replace each galaxy, then the probability of preserving Earth goes down dramatically, because it will have chances to change its mind and do the thing it cares for more (using Boltzmann-rationality assumption the other way). So in this toy example, a slight "caring" in the sense of what the model will say it would pick when quickly asked isn't represented when you look at the distribution of results of many-step plans.
If small motivations do matter, I think you can't discount "weird" preferences to do other things with Earth than preserve it. "Optimize Earth according to proxy X, which will kill all humans but really grow the economy / save the ecosystem / fill it with intelligent life / cure cancer / make it beautiful / maximize law-abiding / create lots of rewarding work for a personal assistant / really preserve Earth". Such motivations sound like they'd be small unless fairly directly optimized for, but the AI is supposed to be acting on small motivations, why not those bad ones, rather than the one we want?
Different regulation (or other legislation) might also make other sorts of transparency good ideas, imo.
A mandate or subsidy for doing safety research might make it a good idea to require transparency for more safety-relevant AI research.
Regulation aimed at improving company practices (e.g. at security against weight theft, or preventing powergrab risks like access to helpful-only models above some threshold, or following some future safety practices suggested by some board of experts, or [to be meta] good transparency practices) should generate some transparency about how companies are doing (at cybersecurity or improper internal use mitigation or safety best practices or transparency).
If safety cases are actually being evaluated and you don't get to do all the research you want if the safety is questionable, then the landscape for transparency of safety cases (or other safety data that might have a different format) looks pretty different.
I'm actually less clear on how risk reports would tie in to regulation - maybe they would get parted out to reports on how the company is doing at various risk-mitigation practices, if those are transparent?
Supposing that we get your scenario where we have basically-aligned automated researchers (but haven't somehow solved the whole alignment problem along the way). What's your take on the "people will want to use automated researchers to create smarter, dangerous AI rather than using them to improve alignment" issue? Is your hope that automated researchers will be developed in one leading organization that isn't embroiled in a race to the bottom, and that org will make a unified pivot to alignment work?
I agree this is worse than it could be. But maybe some of the badness hinges on the "They're your rival group of conspecifics who are doing the thing you have learned not to like about them! That's inherently bad!" reflex, a piece of machinery within myself that I try not to cultivate.
I like it.
You've listed mostly things that countries should do out of self-interest, without much need for international cooperation. (A little bit disanalogous with the UN SDGs.) This is fine, but I think there could also be some useful principles for international regulation of AI that countries could agree to in principle, to pave the way for cooperation even in an atmosphere of competitive rhetoric.
Under Develop Safe AI, it's possible "Alignment" should be broken down into a few chunks, though I'm not sure. There's a current paradigm called "alignment" that uses supervised finetuning + reinforcement learning on large models, where new reward functions, and new ways of leveraging human demonstrations/feedback, all have a family resemblance. And then there's everything else - philosophy of preferences, decision theory for alignment, non-RL alignment of LLMs, neuroscience of human preferences, speculative new architectures that don't fit in the current paradigm. Labels might just be something like "Alignment via finetuning" vs. "Other alignment."
Under Societal Resilience to Disruption, I think "Epistemic Security Measures" could be fleshed out more. The first thing that pops to mind is letting people tell whether some content or message is from an AI, and empowering people to filter for human content / messages. (Proposals range from legislation outlawing impersonating a human, to giving humans unique cryptographic identifiers, to something something blockchain something Sam Altman.)
But you might imagine more controversial and dangerous measures - like using your own AI propaganda-bot to try to combat all external AI propaganda-bots, or instituting censorship based on the content of the message and not just whether its sender is human (which could be a political power play, or as mission creep trying to combat non-AI disinformation under the banner of "Epistemic Security," or because you expect AIs or AI-empowered adversaries to have human-verified accounts spreading their messages). I think the category I'm imagining (which may be different than the category you're imagining) might benefit from a more specific label like "Security from AI Manipulation."
Thanks, just watched a talk by Luxin that explained this. Two questions.
Great post.
I wanted to pick on the model of "sequentially hear out arguments, then stop when get fed up with one," but I think it doesn't make too much difference compared to a more spread-out model where people engage with all the arguments but at different rates, and get fed up globally rather than locally.
In order for your ideas to qualify as science, you need to a) formulate a specific, testable, quantitative hypothesis[2], b) come up with an experiment that will empirically test whether that hypothesis is true, c) preregister what your hypothesis predicts about the results of that experiment (free at OSF), and d) run the experiment[3] and evaluate the results. All of those steps are important! Try to do them in a way that will make it easy to communicate your results. Try to articulate the hypothesis in a clear, short way, ideally in a couple of sentences. Design your experiment to be as strong as possible. If your hypothesis is false, then your experiment should show that; the harder it tries to falsify your hypothesis, the more convincing other people will find it. Always ask yourself what predictions your theory makes that other theories don't, and test those. Preregister not just the details of the experiment, but how you plan to analyze it; use the simplest analysis and statistics that you expect to work.
I think this is the weakest part of the essay, both as philosophy of science and as communication to the hopefully-intended audience.
"Qualifying as science" is not about jumping through a discrete set of hoops. Science is a cultural process where people work together to figure out new stuff, and you can be doing science in lots of ways that don't fit onto the gradeschool "The Scientific Method" poster.
a) You can be doing science without formulating a hypothesis - e.g. observational studies / fishing expeditions, making phenomenological fits to data, building new equipment. If you do have a hypothesis, it doesn't have to be specific (it could be a class of hypotheses), it doesn't have to be testable (it's science to make the same observable predictions as the current leading model in a simpler way), and it doesn't have to be quantitative (you can do important science just by guessing the right causal structure without numbers).
b) You can be doing science without coming up with an experiment (Mainly when you're trying to explain existing results. Or when doing any of that non-hypothesis-centric science mentioned earlier).
c) If you do have a hypothesis and experiment in that order, public pre-registration is virtuous but not required to be science. Private pre-registration, in the sense that you know what your hypothesis predicts, is a simple consequence of doing step (b), and can be skipped when step (b) doesn't apply.
d) Experiments are definitely science! But you can be doing science without them, e.g. if you do steps a-c and leave step d for other people, that can be science.
From a communication perspective, this reads as setting up unrealistic standards of what it takes to "qualify as science," and then using them as a bludgeon against the hopefully-intended audience of people who think they've made an LLM-assisted breakthrough. Such an audience might feel like they were being threatened or excluded, like these standards were just there to try to win an argument.
Although, even if that's true, steps (a)-(d) do have an important social role: they're a great way to convince people (scientists included) without those other people needing to do much work. If you have an underdog theory that other scientists scoff at, but you do steps (a)-(d), many of those scoffers will indeed sit up and take serious notice.
But normal science isn't about a bunch of solo underdogs fighting it out to collate data, do theoretical work, and run experiments independently of each other. Cutting-edge science is often too hard for that even to be reasonable. It's about people working together, each doing their part to make it easier for other people to do their own parts.
This isn't to say that there aren't standards you can demand of people who think they've made a breakthrough. And those standards can be laborious, and even help you win the argument! It just means standards, and the advice about how to meet them, have to be focused more on helping people participate in the cultural process where people work together to figure out new stuff.
A common ask of people who claim to have made advances: do they really know what the state of the art is, in the field they've supposedly advanced? You don't have to know everything, but you have to know a lot! If you're advancing particle physics, you'd better know the standard model and the mathematics required to operate it. And if there's something you don't know about the state of the art, you should just be a few steps away from learning it on your own (e.g. you haven't read some important paper, but you know how to find it, and know how to recurse and read the references or background you need, and pretty soon you'll understand the paper at a professional level).
The reasons you have to really know the state of the art are (1) if you don't, there are a bunch of pitfalls you can fall into so your chances of novel success are slim, and (2) if you don't, you won't know how to contribute to the social process of science.
Which brings us to the more general onerous requirement, one that generalizes steps (a)-(d), is: Have you done hard work to make this actually useful to other scientists? This is where the steps come back in. Because most "your LLM-assisted scientific breakthrough"s are non-quantitative guesses, that hard work is going to look a lot like steps (a) and (b). It means putting in a lot of hard work to make your idea as quantitative and precise as you can, and then to look through the existing data to quantitatively show how your idea compares to the current state of the art on the existing data, then maybe proposing new experiments that could be done, filling in enough detail that you can make quantitative predictions for an experiment that show how the predictions might differ between your idea and the state of the art.
No more sycophancy - now the AI tells you what it believes.
???
The AI will output the words that follow the strategies that worked well in RL, subject to the constraint that they're close to what it predicts would follow the particular encyclopedia-article prompt, and the randomly sampled text so far.
If one of the strategies that worked well in RL is "flatter the preconceptions of the average reader," then it will flatter the preconceptions of the average reader (sycophancy also may come from the behavior of actual human text conditional on the prompt).
If it has a probability distribution over the next word that would cause it to output encyclopedia articles that appear to believe very different things, it will just sample randomly. If slightly different prompts would have resulted in encyclopedia articles that appear to believe very different things, the AI will not let you know this, it will just generate an article conditioned on the prompt.
Agree to disagree about what seems natural, I guess. I think "slight caring" being relative more than absolute makes good sense as a way to talk about some common behaviors of humans and parliaments of subagents, but is a bad fit for generic RL agents.