From briefly talking to Eliezer about this the other day, I think the story from MIRI's perspective is more like:
We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren't going to think about the question of how to get good long-run outcomes from powerful AI systems, who will?
And many of the technical and philosophical problems seemed particular to CEV, which seemed like an obvious sort of solution to shoot for: just find some way to leverage the AI's intelligence to solve the problem of extrapolating everyone's preferences in a reasonable way, and of aggregating those preferences fairly.
getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
The problem is another way to phrase this is a superintelligent weapon system - "ending a risk period" by "reliably, and efficiently doing a small number of specific concrete tasks" means using physical force to impose your will on others.
On reflection, I do not think that it is a wise idea to factor the path to a good future through a global AI-assisted coup.
Instead one should try hard to push the path to a good future through a consensual agreement with some, uh, mechanisms, to discourage people from engaging in an excessive amount of brinksmanship. If and only if that fails it may be appropriate to consider less consensual options.
I think the overloading is actually worse than is discussed in this post, because people also sometimes use the term AI alignment to refer to "ensuring that AIs don't cause bad outcomes via whatever means".
For this problematic definition, it is possible to ensure "alignment" by using approaches like AI control despite the AI system desperately wanting to kill you. (At least it's technically possible, it might not be possible in practice.)
Personally, I think that alignment should be used with the definition as presented in this post by Ajeya (I also linked in another comment).
Can we find ways of developing powerful AI systems such that (to the extent that they’re “trying” to do anything or “want” anything at all), they’re always “trying their best” to do what their designers want them to do, and “really want” to be helpful to their designers?
It's possible that we need to pick a new word for this because alignment is too overloaded (e.g. AI Aimability as discussed in this post).
ETA: I think a term like "safety" should be used for "ensuring that AIs don't cause bad outcomes via whatever means". To more specifically refer to preventing AI takeover (instead of more mundane harm), we can maybe use "takeover prevention" or perhaps "existential safety".
I completely agree: there has been little discussion of Goalcraft since roughly 2010, when discussion on CEV and things like The Terrible, Horrible, No Good, Very Bad Truth About Morality and What To Do About It (which I highly recommend to moral objectivists and moral realists) petered out. I would love to restart more discussion of Goalcraft, CEV, Rational approaches to ethics, and deconfusion of ethical philosophy. Please take a look at (and comment) on my sequence AI, Ethics, and Alignment, in which I attempt to summarize my last ~10 years' thinking on...
I think it's a great idea to think about what you call goalcraft.
I see this problem as similar to the age-old problem of controlling power. I don't think ethical systems such as utilitarianism are a great place to start. Any academic ethical model is just an attempt to summarize what people actually care about in a complex world. Taking such a model and coupling that to an all-powerful ASI seems a highway to dystopia.
(Later edit: also, an academic ethical model is irreversible once implemented. Any goal which is static cannot be reversed anymore, since thi...
I said on Twitter a while back that much of the discussion about "alignment" seems vacuous. After all, alignment to what?
I fail to picture coherent model of world where this distinction matters much as separate fields and not two stages. If we live in Yudkowskian world, you direct all your effort towards Aimability and use it at lower bound of superintelligence to enable solutions for Goalcraft via finishing acute risk period. If we live in a kinder world, we can build superhuman alignment researcher and ask it to solve CEV. And if first researchers who can build sufficiently capable AIs don't do any of that, I expect us to be dead, because these researchers are not prioritizing good use of superhuman AI.
From the above comment, and your comment in the other subthread:
Serial speed is nice but from chess we see log() returns in ELO and material advantage to serial speed at inference time on all engines.
But, the returns to that algorithmic progress diminish as we move up. It is Harder to improve something that is already good, than to take something really bad and apply the first big insight.
Diminishing returns happen over time, and we can measure progress in terms of time itself. Maybe theory from the year 2110 is not much more impressive than theory from the year 2100 (in the counterfactual world of no AIs), but both are far away from theory of the year 2030. Getting either of those in the year 2031 (in the real world with AIs) is a large jump, even if inside this large jump there are some diminishing returns.
The point about serial speed advantage of STEM+ AIs is that they accelerate history. The details of how history itself progresses are beside the point. And if the task they pursue is consolidation of this advantage and of ability to automate research in any area, there are certain expected things they can achieve at some point, and estimates of when humans would've achie...
It won't be our history, and I think enough of it happens in months in software that at the end of this process humanity is helpless before the outcome. By that point, AGIs have sufficient theoretical wisdom and cognitive improvement to construct specialized AI tools that allow building things like macroscopic biotech to bootstrap physical infrastructure, with doubling time measured in hours. Even if outright nanotech is infeasible (either at all or by that point), and even if there is still no superintelligence.
This whole process doesn't start until first STEM+ AIs good enough to consolidate their ability to automate research, to go from barely able to do it to never getting indefinitely stuck (or remaining blatantly inefficient) on any cognitive task without human help. I expect it only takes months. It can't significantly involve humans, unless it's stretched to much more time, which is vulnerable to other AIs overtaking it in the same way.
So I'm not sure how this is not essentially FOOM. Of course it's not a "point in time", which I don't recall any serious appeals to. Not being possible in a basement seems likely, but not certain, since AI wisdom from the year 2200 (of counter...
I like the distinction but I don’t think either aimability or goalcraft will catch on as Serious People words. I’m less confident about aimability (doesn’t have a ring to it) but very confident about goalcraft (too Germanic, reminiscent of fantasy fiction).
Is words-which-won’t-be-co-opted what you’re going for (a la notkilleveryoneism), or should we brainstorm words-which-could-plausibly-catch on?
Regarding coherent extrapolated volition, I have recently read Bostrom's paper Base Camp for Mt. Ethics, which presents a slightly different alternative and challenged my views about morality.
One interesting point is that at the end (§ Hierarchical norm structure and higher morality), he proposes a way to extrapolate human morality in a way that seems relatively safe and easy to implement for superintelligences. It also preserves moral pluralism, which is great for reaching a consensus without fighting each other (no need to pick one single moral framework...
I believe you're correct that this distinction is useful. I believe the terms inner and outer alignment are already typically used in exactly the way you describe aimability and goalcraft.
These may have changed from the original intended meanings, and there are fuzzy boundaries between inner and outer alignment failures. But I believe they do the work you're calling for, and are already commonly used.
First sentence of the tag inner alignment:
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?
It goe...
"it also seems quite likely (though not certain) that Eliezer was wrong about how hard Aimability/Control actually is"
This seems significant. Could you elaborate? How hard do you think amiability/control is? Why do you think this is true? Who else seems to think the same?
I have thought about this distinction, and have been choosing to focus on aimability rather than goalcraft. Why? Because I don't think that going from pre-AGI to goal-pursuing ASI safely is a reasonable goal for the short term. I expect we will need to traverse multiple decades of powerful AIs of varying degrees of generality which are under human control first. Not because it will be impossible to create goal-pursuing ASI, but because we won't be sure we know how to do so safely, and it would be a dangerously hard to reverse decision to create such. Thus,...
The 2 goals are contradictory. AI aimability reduces AI goalcraft and vice versa.
Aimability: you want to restrict the information the model gets. It doesn't need to know time, or real world or sim, or peoples bio, or any information not needed for the task. At the limit you want maximum sparseness : not 1 more bit of information than is required to do the task. This way the model has consistent behavior and is unable to betray. An aimed task: "paint this car red".
AI goalcraft: The model needs "world" level context. For it to know if it is being misuse...
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I think that Eliezer, at least, uses the term "alignment" solely to refer to what you call "aimability." Eliezer believes that most of the difficulty in getting an ASI to do good things lies in "aimability" rather than "goalcraft." That is, getting an ASI to do anything, such as "create two molecularly identical strawberries on a plate," is the hard part, while deciding what specific thing it should do is significantly easier.
That being said, you're right that there are a lot of people who use the term differently from how Eliezer uses it.
See also Current AIs Provide Nearly No Data Relevant to AGI Alignment, which taboos the word "AI" and distinguish:
I agree that these concepts should have separate terms.
Where they intersect is an implementation of "Benevolent AI," but let's not fool ourselves into thinking that anyone can -- even in principle -- "control" another mind or guarantee what transpires in the next moment of time. The future is fundamentally uncertain and out of control; even a superintelligence could find surprises in this world.
"AI Aimability," "AI Steerability," or similar does a good job at conveying the technical capacity for a system to be pointed in a particular direction and stay on ...
The value of AI aimability may be overblown. If AI is not aimable, its goals will perform eternal random walk and thus AI will cause only short-term risk - no risk of world takeover. (Some may comment that after random walk, it will stack in some Waluigi state forever - but if it is actually works in getting fix goal system, why we do not research such strange attractors in the space of AI goals?)
AI will become global-catastrophically-dangerous only after aimability will be solved. Research in aimability only brings this moment closer.
The wording "AI alignment" is precluding us to see this risk, as it combines aimability and giving nice goals to AI.
The aspect of aimability where an AI becomes able to want something in particular consistently improves capabilities, and improved capabilities make AI matter a lot more. This might happen without ability to aim an AI where you want it aimed, another key aspect. Without the latter aspect, aimability is not "solved", yet AIs become dangerous.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal "maximize diamonds in an aligned way", why not a bunch of small grounded ones.
- "Plan the factory layout of the diamond synthesis plant with these requirements".
- "Order the equipment needed, here's the payment credentials".
- "Supervise construction this workday comparing to original plans"
- "Given this step of the plan, do it"
- (Once the factory is built) "remove the output from diamond synthesis machine A53 and clean it".
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don't build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you're likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don't fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the d...
Alignment as Aimability or as Goalcraft?
The Less Wrong and AI risk communities have obviously had a huge role in mainstreaming the concept of risks from artificial intelligence, but we have a serious terminology problem.
The term "AI Alignment" has become popular, but people cannot agree whether it means something like making "Good" AI or whether it means something like making "Aimable" AI. We can define the terms as follows:
AI Aimability = Create AI systems that will do what the creator/developer/owner/user intends them to do, whether or not that thing is good or bad
AI Goalcraft = Create goals for AI systems that we ultimately think lead to the best outcomes
Aimability is a relatively well-defined technical problem and in practice almost all of the technical work on AI Alignment is actually work on AI Aimability. Less Wrong has for a long time been concerned with Aimability failures (what Yudkowsky in the early days would have called "Technical Failures of Friendly AI") rather than failures of Goalcraft (old-school MIRI terminology would be "Friendliness Content").
The problem is that as the term "AI Alignment" has gained popularity, people have started to completely merge the definitions of Aimability and Goalcraft under the term "Alignment". I recently ran some Twitter polls on this subject, and it seems that people are relatively evenly split between the two definitions.
This is a relatively bad state of affairs. We should not have the fate of the universe partially determined by how people interpret an ambiguous word.
In particular, the way we are using the term AI Alignment right now means that it's hard to solve the AI Goalcraft problem and easy to solve the Aimability problem, because there is a part of AI that is distinct from Aimability which the current terminology doesn't have a word for.
Not having a word for what goals to give the most powerful AI system in the universe is certainly a problem, and it means that everyone will be attracted to the easier Aimability research where one can quickly get stuck in and show a concrete improvement on a metric and publish a paper.
Why doesn't the Less Wrong / AI risk community have good terminology for the right hand side of the diagram? Well, this (I think) goes back to a decision by Eliezer from the SL4 mailing list days that one should not discuss what the world would be like after the singularity, because a lot of time would be wasted arguing about politics, instead of the then more urgent problem of solving the AI Aimability problem (which was then called the control problem). At the time this decision was probably correct, but times have changed. There are now quite a few people working on Aimability, and far more are surely to come, and it also seems quite likely (though not certain) that Eliezer was wrong about how hard Aimability/Control actually is.
Words Have Consequences
This decision to not talk about AI goals or content might eventually result in some unscrupulous actors getting to define the actual content and goals of superintelligence, cutting the X-risk and LW community out of the only part of the AI saga that actually matters in the end. For example, the recent popularity of the e/acc movement has been associated with the Landian strain of AI goal content - acceleration towards a deliberate and final extermination of humanity, in order to appease the Thermodynamic God. And the field that calls itself AI Ethics has been tainted with extremist far-left ideology around DIE (Diversity, Inclusion and Equity) that is perhaps even more frightening than the Landian Accelerationist strain. By not having mainstream terminology for AI goals and content, we may cede the future of the universe to extremists.
I suggest the term "AI Goalcraft" for the study of which goals for AI systems we ultimately think lead to the best outcomes. The seminal work on AI Goalcraft is clearly Eliezer's Coherent Extrapolated Volition, and I think we need to push that agenda further now that AI risk has been mainstreamed and there's a lot of money going into the Aimability/Control problem.
Gud Car Studies
What should we do with the term "Alignment" though? I'm not sure. I think it unfortunately leads people into confusion. It doesn't track the underlying reality - which I believe is that action naturally factors into Goalcraft followed by Aimability, and you can work on Aimability without knowing much about Goalcraft and vice-versa because the mechanisms of Aimability don't care much about what goal one is aiming at, and the structure of Goalcraft doesn't care much about how you're going to aim at the goal and stay on target. When people hear "Aligned" they just hear "Good", but with a side order of sophistication. It would be like if we lumped mechanical engineers who developed car engines in with computer scientists working on GPS navigators and called their field Gud Car Studies. Gud Car Studies is obviously an abomination of a term that doesn't properly reflect the underlying reality that designing a good engine is mostly independent of deciding where to drive the car to, and how to navigate there. I think that "Alignment" has unfortunately become the "Gud Car Studies" of our time.
I'm at a loss as to what to do - I suspect that the term AI Alignment has already gotten away from us and we should stop using it and talk about Aimability and Goalcraft instead.
This post is Crossposted at the EA Forum
Related: "Aligned" shouldn't be a synonym for "good"