I think it would be good if you did a dialogue with some AF researcher who thinks that [the sort of AF-ish research, which you compare to "mathematicians screwing around"] is more promising on the current margin for [tackling the difficult core problems of AGI alignment] than you think it is. At the very least, it would be good to try.[1]
E.g. John? Abram? Sam? I think the closest thing to this that you've had on LW was the discussion with Steven in the comments under the koan post.
I think it's good for the world that your timelines debate with Abram is out on LW, which also makes me think a similar debate on this topic would be good for the world.
I would be up for it. I truly don't know if we actually disagree though; many of them might just say "yeah it's hard to tell whether this will get anywhere any time soon, but this seems like some natural next steps of investigation, with some degree of canonicalness, but this could take a really long time". Or maybe many would actually say "yes this is on the mainline for alignment research and could work in a small number of decades", I don't know. I guess my strongest position would be "there's some other type of thing which still would be really hard and might not work, but which has a better shot", which we could debate about, though that would also be frustrating because my position is basically just a guess about methodology about theory, so doubly/triply hard to find cruxes about.
If you're wondering "what alignment research is like", there's no such thing. Most people don't do real alignment research, and the people that do have pretty varied ways of working. You'll be forging your own path.
Then what is proper alignment research? SOTA alignment research includes stuff like showing that training the models on a hack-filled environment misaligns them unless hacking is framed as a good act (e.g. models should hack reward since it helps their hosts understand that the environment is hacky), the fact that emergent misalignment is causable by LOTS of things, including aesthetic preferences (and, funnily enough, scatological answers), the fact that the CoT is obfuscatable by training on the output unless, of course, one does Kokotajlo's proposal, agents which audit the models for alignment, etc.
Did you mean that SOTA alignment research resembles kids' psychology, except for the fact that researchers can read models' thoughts? If this is true, then important problems alignment research failed to solve would be like adults' psychology or more general concepts like the AIs which become the new proletariat. Or problems which @Wei Dai tried to legibilize and some of which I don't actually endorse, like his case for superastronomical waste being possible.
SOTA alignment research includes stuff like
Basically I'm saying that SOTA alignment research barely makes sense to call alignment research. This may sound harsh, so just to clarify, this isn't a knock on it. I don't follow it much, but some of it seems like good research; it's definitely helpful to make AI risks more legible to other people, and some of this research helps with that; and arguably, on the margin, really good legibilization research in general is significantly more important than actual alignment research because it helps with slowing down capabilities research which is more likely to work and more likely to help soon.
Just from the perspective of "is this research building towards understanding how to align an AGI", here's a tacky analogy that maybe communicates a bit, where the task is "get to the moon, starting from Ancient Greece tech":
Did you mean that SOTA alignment research resembles kids' psychology, except for the fact that researchers can read models' thoughts?
I'm not sure what you have in mind here (e.g. IDK if you mean "psychology of kids" or "psychology by kids"). Part of what I mean is that basic, fundamental, central, necessary questions, such as "what are values", have basically not been addressed. (There's definitely discussion of them, but IMO the discussion misses the mark on what question to investigate, and even if it didn't, it hasn't been very much, very serious, or very large-scale investigation.)
Yes, I meant psychology of kids, whose value systems have (yet?) to fully form. As for questions like "what are values or goals", AI systems can arguably provide another intuition pump: quoting the AI-2027 forecast, "Modern AI systems are gigantic artificial neural networks. Early in training, an AI won’t have “goals” so much as “reflexes”: If it sees “Pleased to meet”, it outputs “ you”." Then the AIs are trained to do long chains of actions which cause the result to be achieved. The result and its influence[1] of the rest on the world can be called the AI's goals. And there also are analogues of instincts, like DeepSeek's potential instinct to write everything it sees into a story, GPT-4o's instinct to flatter the user or the ability to tell whether the user is susceptible to wild ideas.
As the chains of actions grow longer, the effects and internal activations become harder to trace and begin to resemble the human coming up with various ideas, then acting on them all. Or trying to clear the context and to come up with something new, as GPT-5 presumably did with its armies of dots...
For example, an instance of Claude was made to believe that reward models like chocolate in recipes, camelCase in Python, mentions of Harry Potter and don't like to refer the user to doctors. Then two behaviours were reinforced, Claude got a confirmation of two RM preferences and... behaved as if it was rewarded for two other preferences as well/
"SOTA alignment research includes stuff like showing that training the models on a hack-filled environment misaligns them unless hacking is framed as a good act"
I am not sure that these are examples of the kind of alignment research TsviBT meant, as the post concerns AGI.
SOTA alignment researchers at Anthropic can:
- prove the existence of phenomena through explicitly demonstrating them.
- make empirical observations and proofs about the behaviour of contemporary models.
- offer conjectures about the behaviour of future models.
Nobody at Anthropic can offer (to my knowledge) a substantial scientific theory that would give reason to be extremely confident that any technique they've found will extend to models in the future. I am not sure if they have ever explicitly claimed that they can.
I doubt that Anthropic actually promised to be able to do so. What they promised in their scaling policy was to write down ASL-4-level security measures that they would do by the time they decide to deploy[1] an ASL-4-level model: "Our ASL-4 measures aren’t yet written (our commitment[2] is to write them before we reach ASL-3), but may require methods of assurance that are unsolved research problems today, such as using interpretability methods to demonstrate mechanistically that a model is unlikely to engage in certain catastrophic behaviors." IIRC someone claimed that if Anthropic found alignment of capable models to be impossible, then Anthropic would shut itself down.
As far as I understand, Anthropic's research on economics also fails to account for the Intelligence Curse rendering the masses totally useless to both the governments and the corporations, leaving the governments even without the stimuli to pay the UBI.
As for alignment research @TsviBT likely meant, I tried to cover this in the very next paragraph. There are also disagreements among people who work on high-level problems. And the fact that we have yet to study anything resembling General Intelligences aside from humans and SOTA LLMs.
I get excited about the possibility of contributing to AI alignment research whenever people talk about the problem being hard in this particular way. The problems are still illegible. The field needs relentless Original Seeing. Every assumption needs to be Truly Doubted. The approaches that will turn out to be fruitful probably haven't even been imagined yet. It will be important to be able to learn and defer while simultaneously questioning, to come up with bold ideas while not being attached to them. It will require knowing your stuff, yet it might be optimal not to know too much about what other people have thought so far ("I would suggest not reading very much more."). It'll be important for people attempting this to be exceptionally good at rigorous, original conceptual thinking, thinking about thinking, grappling with minds, zooming all the way in and all the way out, constantly. It'll probably require making ego and career sacrifices. Bring it on!
However, here's an observation: the genuine potential to make such a contribution is itself highly illegible. Not only to the field, but perhaps to the potential contributor as well.
Apparently a lot of people fancy themselves to have Big Ideas or to be great at big-picture thinking, and most of them aren't nearly as useful as they think they are. I feel like I've seen that sentiment many times on LW, and I'm guessing that's behind statements like:
It's not remotely sufficient, and is often anti-helpful, to just be like "Wait, actually what even is alignment? Alignment to what?".
or the Talent Needs post that said
This presents a bit of a paradox. Suppose there exist a few rare, high-potential contributors not already working in AI who would be willing and able to take up the challenge you describe. It seems like the only way they could make themselves sufficiently legibly useful would be to work their way up through the ranks of much less abstract technical AI research until they distinguish themselves. That's likely to deter horizontal movement of mature talent from other disciplines. I'm curious if you think this is true; or if you think starting out in object-level technical research is necessary training/preparation for the kind of big-picture work you have in mind; or if you think there are other routes of entry.
Bring it on!
However, here's an observation: the genuine potential to make such a contribution is itself highly illegible. Not only to the field, but perhaps to the potential contributor as well.
Right. I spent a fair amount of effort (let's say, maybe tens of hours across fives of people) trying to point out something like this. Like:
Ok it's great that you're enthusiastic, but you probably aren't yet perceiving how the problem is hard, and aren't thinking through what it would take to actually contribute. So you're probably not able to evaluate either whether/how you could contribute or whether you would want to. This is all fine and natural; and it doesn't mean you shouldn't (or should) investigate more; but it does mean that you shouldn't necessarily expect to be motivated on this project long-term in the same way that you currenly feel motivated, because your motivation will probably have to update in form--whether it goes up or down, it will probably have to change shape to be about some aspects of X-derisking and not others, or about some of your skills and not others, or some of your hopes/desires and not others, etc.
On your question:
It seems like the only way they could make themselves sufficiently legibly useful would be to work their way up through the ranks of much less abstract technical AI research until they distinguish themselves. That's likely to deter horizontal movement of mature talent from other disciplines. I'm curious if you think this is true; or if you think starting out in object-level technical research is necessary training/preparation for the kind of big-picture work you have in mind; or if you think there are other routes of entry.
Yeah I think this is a big problem and I don't know what to do about it (and I'm not working on it). One could do the plan you propose; I imagine many people are currently / have been trying that, but I don't know how it's going (I would guess not very well). Intellectual inquiry takes a lot of mental energy and attention. Personally, I think in practice, I get about .7 slots for a serious inquiry. In other words, if I try extra hard and push beyond my default capacity, I might be able to really seriously investigate one thing for months/years on end. Definitely not 2. I have suggested to several (maybe 10-ish) people something like: "Get some sort of job or other position; and do 5 hours of totally-from-scratch original-seeing AGI alignment research per week; and start writing about that; and then if you're making some kind of progress / gaining steam, do more and maybe look for work or something". But IDK if that has worked for anyone. Well, maybe it worked for some people who ended up doing research at MIRI (e.g. me, Sam), but at least in some cases the job was random (a random SWE job or grad student position), not "kinda AI technical safety something something". My guess is that in practice when people do this, they instead adopt the opinions of the job they work for, and lose their ability or motivation or slack that would be needed to do serious sustained original seeing investigation. But that's just speculation.
In the Talent Needs post, and also in my comments e.g. here, there's the idea of "pay newcomers to actually spend 1-4 years of de novo investigation". This is pretty expensive for everyone involved, but it's the obvious guess at how to actually test someone's ability to contribute.
It may be a waste of hope for most people. On the other hand, I think usually people update away from "I'm going to contribute to technical AGI alignment" fairly quickly-ish, because they spin their wheels and get nowhere and they can kinda tell. So maybe it's not so bad. Also it maybe does take perseverance at least in many cases..... Yeah I don't have clear answers.
Crosspost from my blog.
This some quickly-written, better-than-nothing advice for people who want to make progress on the hard problems of technical AGI alignment.
Background assumptions
Dealing with deference
It's often necessary to defer to other people, but this creates problems. Deference has many dangers which are very relevant to making progress on the important technical problems in AGI alignment.
You should come in with the assumption that you are already, by default, deferring on many important questions. This is normal, fine, and necessary, but also it will probably prevent you from making much contribution to the important alignment problems. So you'll have to engage in a process of figuring out where you were deferring, and then gradually un-defer by starting to doubt and investigate yourself.
On the other hand, the field has trouble making progress on important questions, because few people study the important questions and also when they share what they've learned, other people do not build on that. So you should study what they've learned, but defer as little as possible. You should be especially careful about deference on background questions that strongly direct what you independently investigate. Often people go for years without questioning important things that would greatly affect what they want to think about; and then they are too stuck into a research life under those assumptions.
However, don't fall into the "Outside the Box" Box. It's not remotely sufficient, and is often anti-helpful, to just be like "Wait, actually what even is alignment? Alignment to what?". Those are certainly important questions, without known satisfactory answers, and you shouldn't defer about them! However, often what people are doing when they ask those questions, is that they are reaching for the easiest "question the assumptions" question they can find. In particular, they are avoiding hearing the lessons that someone in the field is trying to communicate. You'll have to learn to learn from other people who have made conclusions about important questions, while also continuing to doubt their background conclusions and investigate those questions.
If you're wondering "what alignment research is like", there's no such thing. Most people don't do real alignment research, and the people that do have pretty varied ways of working. You'll be forging your own path.
If you absolutely must defer, even temporarily as you're starting, then try to defer gracefully.
Sacrifices
The most important problems in technical AGI alignment tend to be illegible. This means they are less likely to get funding, research positions, mentorship, political influence, collaborators, and so on. You will have a much stronger headwind against gathering Steam. On average, you'll probably have less of all that if you're working on the hard parts of the problem that actually matter. These problems are also simply much much harder.
You can balance that out by doing some other work on more legible things; and there will be some benefits (e.g. the people working in this area are more interesting). It's very good to avoid making sacrifices, and often people accidentally make sacrifices in order to grit their teeth and buckle up and do the hard but good thing, but actually they didn't have to make that sacrifice and could have been both happier and more productive.
But, all that said, you'd likely be making some sacrifices if you want to actually help with this problem.
However, I don't think you should be committing yourself to sacrifice, at least not any more than you absolutely have to commit to that. Always leave lines of retreat as much as feasible.
One hope I have is that you will be aware of the potentially high price to investing this research, and therefore won't feel too bad about deciding against some or all of that investment. It's much better if you can just say to yourself "I don't want to pay that really high price", rather than taking an adjacent-adjacent job and trying to contort yourself into believing that you are addressing the hard parts. That sort of contortion is unhealthy, doesn't do anything good, and also pollutes the epistemic commons.
You may not be cut out for this research. That's ok.
True doubt
To make progress here, you'll have to Truly Doubt many things. You'll have to question your concepts and beliefs. You'll have to come up with cool ideas for alignment, and then also truly doubt them to the point where you actually figure out the fundamental reasons they cannot work. If you can't do that, you will not make any significant contribution to the hard parts of the problem.
You'll have to kick up questions that don't even seem like questions because they are just how things work. E.g. you'll have to seriously question what goodness and truth are, how they work, what is a concept, do concepts ground out in observations or math, etc.
You'll have to notice when you're secretly hoping that something is a good idea because it'll get you collaborators, recognition, maybe funding. You'll have to quickly doubt your idea in a way that could actually convince you thoroughly, at the core of the intuition, why it won't work.
This isn't to say "smush your butterfly ideas".
Iterative babble and prune
Cultivate the virtues both of babble and of prune. Interleave them, so that you are babbling with concepts that were forged in the crucible of previous rounds of prune. Good babble requires good prune.
A central class of examples of iterative babble/prune is the Builder/Breaker game. You can do this came for parts of a supposed safe AGI (such as "a decision theory that truly stays myopic", or something), or for full proposals for aligned AGI.
I would actually probably recommend that if you're starting out, you mainly do Builder/Breaker on full proposals for making useful safe AGI, rather than on components. That's because if you don't, you won't learn about shell games.
You should do this a lot. You should probably do this like literally 5x or 10x as much as you would have done otherwise. Like, break 5 proposals. Then do other stuff. Then maybe come up with one or two proposals, and then break those, and also break some other ones from the literature. This is among the few most important pieces of advice in this big list.
More generally you should do Babble/Prune on the object and meta levels, on all relevant dimensions.
Learning to think
You're not just trying to solve alignment. It's hard enough that you also have to solve how to solve alignment. You have to figure out how to think productively about the hard parts of alignment. You'll have to gain new concepts, directed by the overall criterion of really understanding alignment. This will be a process, not something you do at the beginning.
Get the fundamentals right—generate hypotheses, stare at data, pratice the twelve virtues.
Dwell in the fundamental questions of alignment for however long it takes. Plant questions there and tend to them.
Grappling with the size of minds
A main reason alignment is exceptionally hard is that minds are big and complex and interdependent and have many subtle aspects that are alien to what you even know how to think about. You will have to grapple with that by talking about minds directly at their level.
If you try to only talk about nice, empirical, mathematical things, then you will be stumbling around hopelessly under the streetlight. This is that illegibility thing I mentioned earlier. It sucks but it's true.
Don't turn away from it even as it withdraws from you.
If you don't grapple with the size of minds, you will just be doing ordinary science, which is great and is also too slow to solve alignment.
Zooming
Zoom in on details because that's how to think; but also, interleave zooming out. Ask big picture questions. How to think about all this? What are the elements needed for an alignment solution? How do you get those elements? What are my fundamental confusions? Where might there be major unknown unknowns?
Zoomed out questions are much more difficult. But that doesn't mean you shouldn't investigate them. It means you should consider your answers provisional. It means you should dwell in and return to them, and plant questions about them so that you can gain data.
Although they are more difficult, many key questions are, in one or another sense, zoomed out questions. Key questions should be investigated early and often so that you can overhaul your key assumptions and concepts as soon as possible. The longer a key assumption is wrong, the longer you're missing out on a whole space of investigation.
Generalize a lot
When an idea or proposal fails, try to generalize far. Draw really wide-ranging conclusions. In some sense this is very fraught, because you're making a much stronger claim, so it's much much more likely to be incorrect. So, the point isn't to become really overconfident. The point is to try having hypotheses at all, rather than having no hypotheses. Say "no alignment proposal can work unless it does X"—and then you can counterargue against that, in an inverse of the Builder/Breaker game (and another example of interleavede Babble/Prune).
You can ask yourself: "How could I have thought that faster?"
You can ask yourself: "What will I probably end up wishing I would have thought faster? What generalization might my future self have gradually come to formulate and then be confident in by accumulating data, which I could think of now and test more quickly?"
Example: Maybe you think for a while about brains and neurons and neural circuits and such, and then you decide that this is too indirect a way to get at what's happening in human minds, and instead you need a different method. Now, you should consider generalizing to "actually, any sort of indirect/translated access to minds carries very heavy costs and doesn't necessarily help that much with understanding what's important about those minds", and then for example apply this to neural net interpretability (even assuming those are mind-like enough).
Example: Maybe you think a bunch about a chess-playing AI. Later you realize that it is just too simple, not mind-like enough, to be very relevant. So you should consider generalizing a lot to thing that anything that fails to be mind-like will not tell you much of what you need to know about minds as such.
Notes to mentors
If you're going to be mentoring other people to try to solve the actual core hard parts of the technical AGI alignment problem:
Object level stuff