I agree this is worse than it could be. But maybe some of the badness hinges on the "They're your rival group of conspecifics who are doing the thing you have learned not to like about them! That's inherently bad!" reflex, a piece of machinery within myself that I try not to cultivate.
I like it.
You've listed mostly things that countries should do out of self-interest, without much need for international cooperation. (A little bit disanalogous with the UN SDGs.) This is fine, but I think there could also be some useful principles for international regulation of AI that countries could agree to in principle, to pave the way for cooperation even in an atmosphere of competitive rhetoric.
Under Develop Safe AI, it's possible "Alignment" should be broken down into a few chunks, though I'm not sure. There's a current paradigm called "alignment" that uses supervised finetuning + reinforcement learning on large models, where new reward functions, and new ways of leveraging human demonstrations/feedback, all have a family resemblance. And then there's everything else - philosophy of preferences, decision theory for alignment, non-RL alignment of LLMs, neuroscience of human preferences, speculative new architectures that don't fit in the current paradigm. Labels might just be something like "Alignment via finetuning" vs. "Other alignment."
Under Societal Resilience to Disruption, I think "Epistemic Security Measures" could be fleshed out more. The first thing that pops to mind is letting people tell whether some content or message is from an AI, and empowering people to filter for human content / messages. (Proposals range from legislation outlawing impersonating a human, to giving humans unique cryptographic identifiers, to something something blockchain something Sam Altman.)
But you might imagine more controversial and dangerous measures - like using your own AI propaganda-bot to try to combat all external AI propaganda-bots, or instituting censorship based on the content of the message and not just whether its sender is human (which could be a political power play, or as mission creep trying to combat non-AI disinformation under the banner of "Epistemic Security," or because you expect AIs or AI-empowered adversaries to have human-verified accounts spreading their messages). I think the category I'm imagining (which may be different than the category you're imagining) might benefit from a more specific label like "Security from AI Manipulation."
Thanks, just watched a talk by Luxin that explained this. Two questions.
Great post.
I wanted to pick on the model of "sequentially hear out arguments, then stop when get fed up with one," but I think it doesn't make too much difference compared to a more spread-out model where people engage with all the arguments but at different rates, and get fed up globally rather than locally.
In order for your ideas to qualify as science, you need to a) formulate a specific, testable, quantitative hypothesis[2], b) come up with an experiment that will empirically test whether that hypothesis is true, c) preregister what your hypothesis predicts about the results of that experiment (free at OSF), and d) run the experiment[3] and evaluate the results. All of those steps are important! Try to do them in a way that will make it easy to communicate your results. Try to articulate the hypothesis in a clear, short way, ideally in a couple of sentences. Design your experiment to be as strong as possible. If your hypothesis is false, then your experiment should show that; the harder it tries to falsify your hypothesis, the more convincing other people will find it. Always ask yourself what predictions your theory makes that other theories don't, and test those. Preregister not just the details of the experiment, but how you plan to analyze it; use the simplest analysis and statistics that you expect to work.
I think this is the weakest part of the essay, both as philosophy of science and as communication to the hopefully-intended audience.
"Qualifying as science" is not about jumping through a discrete set of hoops. Science is a cultural process where people work together to figure out new stuff, and you can be doing science in lots of ways that don't fit onto the gradeschool "The Scientific Method" poster.
a) You can be doing science without formulating a hypothesis - e.g. observational studies / fishing expeditions, making phenomenological fits to data, building new equipment. If you do have a hypothesis, it doesn't have to be specific (it could be a class of hypotheses), it doesn't have to be testable (it's science to make the same observable predictions as the current leading model in a simpler way), and it doesn't have to be quantitative (you can do important science just by guessing the right causal structure without numbers).
b) You can be doing science without coming up with an experiment (Mainly when you're trying to explain existing results. Or when doing any of that non-hypothesis-centric science mentioned earlier).
c) If you do have a hypothesis and experiment in that order, public pre-registration is virtuous but not required to be science. Private pre-registration, in the sense that you know what your hypothesis predicts, is a simple consequence of doing step (b), and can be skipped when step (b) doesn't apply.
d) Experiments are definitely science! But you can be doing science without them, e.g. if you do steps a-c and leave step d for other people, that can be science.
From a communication perspective, this reads as setting up unrealistic standards of what it takes to "qualify as science," and then using them as a bludgeon against the hopefully-intended audience of people who think they've made an LLM-assisted breakthrough. Such an audience might feel like they were being threatened or excluded, like these standards were just there to try to win an argument.
Although, even if that's true, steps (a)-(d) do have an important social role: they're a great way to convince people (scientists included) without those other people needing to do much work. If you have an underdog theory that other scientists scoff at, but you do steps (a)-(d), many of those scoffers will indeed sit up and take serious notice.
But normal science isn't about a bunch of solo underdogs fighting it out to collate data, do theoretical work, and run experiments independently of each other. Cutting-edge science is often too hard for that even to be reasonable. It's about people working together, each doing their part to make it easier for other people to do their own parts.
This isn't to say that there aren't standards you can demand of people who think they've made a breakthrough. And those standards can be laborious, and even help you win the argument! It just means standards, and the advice about how to meet them, have to be focused more on helping people participate in the cultural process where people work together to figure out new stuff.
A common ask of people who claim to have made advances: do they really know what the state of the art is, in the field they've supposedly advanced? You don't have to know everything, but you have to know a lot! If you're advancing particle physics, you'd better know the standard model and the mathematics required to operate it. And if there's something you don't know about the state of the art, you should just be a few steps away from learning it on your own (e.g. you haven't read some important paper, but you know how to find it, and know how to recurse and read the references or background you need, and pretty soon you'll understand the paper at a professional level).
The reasons you have to really know the state of the art are (1) if you don't, there are a bunch of pitfalls you can fall into so your chances of novel success are slim, and (2) if you don't, you won't know how to contribute to the social process of science.
Which brings us to the more general onerous requirement, one that generalizes steps (a)-(d), is: Have you done hard work to make this actually useful to other scientists? This is where the steps come back in. Because most "your LLM-assisted scientific breakthrough"s are non-quantitative guesses, that hard work is going to look a lot like steps (a) and (b). It means putting in a lot of hard work to make your idea as quantitative and precise as you can, and then to look through the existing data to quantitatively show how your idea compares to the current state of the art on the existing data, then maybe proposing new experiments that could be done, filling in enough detail that you can make quantitative predictions for an experiment that show how the predictions might differ between your idea and the state of the art.
No more sycophancy - now the AI tells you what it believes.
???
The AI will output the words that follow the strategies that worked well in RL, subject to the constraint that they're close to what it predicts would follow the particular encyclopedia-article prompt, and the randomly sampled text so far.
If one of the strategies that worked well in RL is "flatter the preconceptions of the average reader," then it will flatter the preconceptions of the average reader (sycophancy also may come from the behavior of actual human text conditional on the prompt).
If it has a probability distribution over the next word that would cause it to output encyclopedia articles that appear to believe very different things, it will just sample randomly. If slightly different prompts would have resulted in encyclopedia articles that appear to believe very different things, the AI will not let you know this, it will just generate an article conditioned on the prompt.
Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.
Whether this is fortunate depends a lot on how beneficial communication with unaligned AIs is. If unaligned AI with high chance of takeover can exploit trade to further increase its chances of takeover ("Oh, I just have short-term preferences where I want you to run some scientific simulations for me"), then this correlation is the opposite of fortunate. If people increase an unaligned AI's situational awareness so it can trust our trade offer, then the correlation seems indirectly bad for us.
Fun read, thanks for sharing, application to AI safety doubtful :)
Thanks! I think your perspective is important for me to engage with, since I'm mostly concerned with doing step 1 much better than what you think of as succeeding at step 1.
In particular, the problem of evaluating performance even in safe situations seems like something we could do much better at "if we knew what we were doing" (for hard problems and for manipulation, which you mention, and for ambiguity/underspecification, which is easy to forget about).
So prong one is to try to know what we're doing better - e.g. by finding improvements to architectures and training schemes to support good performance evaluations. And prong two is to figure out how to better muddle ahead with bad evaluations, "the AI will be misaligned but hopefully it's not too bad and we can do other things to compensate" style.
A random nitpick:
In particular: the early discourse about AI alignment seemed quite concerned, in various ways, about the problem of crafting/specifying good instructions.
"With a safe genie, wishing is superfluous. Just run the genie." - The Hidden Complexity of Wishes (2007)
Supposing that we get your scenario where we have basically-aligned automated researchers (but haven't somehow solved the whole alignment problem along the way). What's your take on the "people will want to use automated researchers to create smarter, dangerous AI rather than using them to improve alignment" issue? Is your hope that automated researchers will be developed in one leading organization that isn't embroiled in a race to the bottom, and that org will make a unified pivot to alignment work?