I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details to large level structure, this is especially important for the alignment problem), not being efficient market pilled (you have to believe that more is possible in order to aim for it), noticing confusion, and probably a lot more that I’m failing to name here.
Most importantly, though, I think it requires quite a lot of willingness to remain confused. Many scientists who accomplished great things (Darwin, Einstein) didn’t have publishable results on their main inquiry for years. Einstein, for instance, talks about wandering off for weeks in a state of “psychic tension” in his youth, it took ~ten years to go from his first inkling of ...
This is the sort of thing I find appealing to believe, but I feel at least somewhat skeptical of. I notice a strong emotional pull to want this to be true (as well as an interesting counterbalancing emotional pull for it to not be true).
I don't think I've seen output from the people aspiring in this direction without being visibly quite smart to make me think "okay yeah it seems like it's on track in some sense."
I'd be interested in hearing more explicit cruxes from you about it.
I do think it's plausible than the "smart enough, creative enough, strong epistemics, independent, willing to spend years without legible output, exceptionally driven, and so on" are sufficient (if you're at least moderately-but-not-exceptionally-smart). Those are rare enough qualities that it doesn't necessarily feel like I'm getting a free lunch, if they turn out to be sufficient for groundbreaking pre-paradigmatic research. I agree the x-risk pipeline hasn't tried very hard to filter for and/or generate people with these qualities.
(well, okay, "smart enough" is doing a lot of work there, I assume from context you mean "pretty smart but not like genius smart")
But, I've only really seen you note...
I think this is right. A couple of follow-on points:
There's a funding problem if this is an important route to progress. If good work is illegible for years, it's hard to decide who to fund, and hard to argue for people to fund it. I don't have a proposed solution, but I wanted to note this large problem.
Einstein did his pre-paradigmatic work largely alone. Better collaboration might've sped it up.
LessWrong allows people to share their thoughts prior to having publishable journal articles and get at least a few people to engage.
This makes the difficult pre-paradigmatic thinking a group effort instead of a solo effort. This could speed up progress dramatically.
This post and the resulting comments and discussions is an example of the community collectively doing much of the work you describe: traversing levels, practicing good epistemics, and remaining confused.
Having conversations with other LWers (on calls, by DM, or in extended comment threads) is tremendously useful for me. I could produce those same thoughts and critiques, but it would take me longer to arrive at all of those different viewpoints of the issue. I mention this to encourage others to do it. Communication takes time...
I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess.
To be clear, I wasn't talking about physics postdocs mainly because of raw g. Raw g is a necessary element, and physics postdocs are pretty heavily loaded on it, but I was talking about physics postdocs mostly because of the large volume of applied math tools they have.
The usual way that someone sees footholds on the hard parts of alignment is to have a broad enough technical background that they can see some analogy to something they know about, and try borrowing tools that work on that other thing. Thus the importance of a large volume of technical knowledge.
I currently think broad technical knowledge is the main requisite, and I think self-study can suffice for the large majority of that in principle. The main failure mode I see would-be autodidacts run into is motivation, but if you can stay motivated then there's plenty of study materials.
For practice solving novel problems, just picking some interesting problems (preferably not AI) and working on them for a while is a fine way to practice.
Good post, although I have some misgivings about how unpleasant it must be to read for some people.
One factor not mentioned here is the history of MIRI. MIRI was a pioneer in the field, and it was MIRI who articulated and promoted the agent foundations research agenda. The broad goals of agent foundations[1] are (IMO) load-bearing for any serious approach to AI alignment. But, when MIRI essentially declared defeat, in the minds of many that meant that any approach in that vein is doomed. Moreover, MIRI's extreme pessimism deflates motivation and naturally produces the thought "if they are right then we're doomed anyway, so might as well assume they are wrong".
Now, I have a lot of respect for Yudkowsky and many of the people who worked at MIRI. Yudkowsky started it all, and MIRI made solid contributions to the field. I'm also indebted to MIRI for supporting me in the past. However, MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris), looking for nails instead of looking for hammers, and poor organization[2].
MIRI made important progress in agent foundations, but also missed an opportunity to do ...
I'm sympathetic to most prosaic alignment work being basically streetlighting. However, I think there's a nirvana fallacy going on when you claim that the entire field has gone astray. It's easiest to illustrate what I mean with an analogy to capabilities.
In capabilities land, there were a bunch of old school NLP/CV people who insisted that there's some kind of true essence of language or whatever that these newfangled neural network things weren't tackling. The neural networks are just learning syntax, but not semantics, or they're ungrounded, or they don't have a world model, or they're not representing some linguistic thing, so therefore we haven't actually made any progress on true intelligence or understanding etc etc. Clearly NNs are just progress on the surface appearance of intelligence while actually just being shallow pattern matching, so any work on scaling NNs is actually not progress on intelligence at all. I think this position has become more untenable over time. A lot of people held onto this view deep into the GPT era but now even the skeptics have to begrudgingly admit that NNs are pretty big progress even if additional Special Sauce is needed, and that the other ...
I think you have two main points here, which require two separate responses. I'll do them opposite the order you presented them.
Your second point, paraphrased: 90% of anything is crap, that doesn't mean there's no progress. I'm totally on board with that. But in alignment today, it's not just that 90% of the work is crap, it's that the most memetically successful work is crap. It's not the raw volume of crap that's the issue so much as the memetic selection pressures.
Your first point, paraphrased: progress toward the the hard problem does not necessarily immediately look like tackling the meat of the hard problem directly. I buy that to some extent, but there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem, whether direct or otherwise. And indeed, I would claim that prosaic alignment as a category is a case where people are not making progress on the hard problems, whether direct or otherwise. In particular, one relevant criterion to look at here is generalizability: is the work being done sufficiently general/robust that it will still be relevant once the rest of the problem is solved (and multiple things change in not-yet-predictable ways in order to solve the rest of the problem)? See e.g. this recent comment for an object-level example of what I mean.
in capabilities, the most memetically successful things were for a long time not the things that actually worked. for a long time, people would turn their noses at the idea of simply scaling up models because it wasn't novel. the papers which are in retrospect the most important did not get that much attention at the time (e.g gpt2 was very unpopular among many academics; the Kaplan scaling laws paper was almost completely unnoticed when it came out; even the gpt3 paper went under the radar when it first came out.)
one example of a thing within prosaic alignment that i feel has the possibility of generalizability is interpretability. again, if we take the generalizability criteria and map it onto the capabilities analogy, it would be something like scalability - is this a first step towards something that can actually do truly general reasoning, or is it just a hack that will no longer be relevant once we discover the truly general algorithm that subsumes the hacks? if it is on the path, can we actually shovel enough compute into it (or its successor algorithms) to get to agi in practice, or do we just need way more compute than is practical? and i think at the time of gpt2 these were completely unsettled research questions! it was actually genuinely unclear whether writing articles about ovid's unicorn is a genuine first step towards agi, or just some random amusement that will fade into irrelevancy. i think interp is in a similar position where it could work out really well and eventually become the thing that works, or it could just be a dead end.
ok good that we agree interp might plausibly be on track. I don't really care to argue about whether it should count as prosaic alignment or not. I'd further claim that the following (not exhaustive) are also plausibly good (I'll sketch each out for the avoidance of doubt because sometimes people use these words subtly differently):
My guess is a roughly equally "central" problem is the incentive landscape around the OpenPhil/Anthropic school of thought
(Prefatory disclaimer that, admittedly as an outsider to this field, I absolutely disagree with the labeling of prosaic AI work as useless streetlighting, for reasons building upon what many commenters wrote in response to the very posts you linked here as assumed background material. But in the spirit of your post, I shall ignore that moving forward.)
The "What to Do About It" section dances around but doesn't explicitly name one of the core challenges of theoretical agent-foundations work that aims to solve the "hard bits" of the alignment challenge, namely the seeming lack of reliable feedback loops that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium. As Conor Leahy concisely put it:
Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.
He was talking about philosophy in particular at that juncture, in response to Wei Dai's concerns over metaphilosophical competence, but this po...
I actually disagree with the natural abstractions research being ungrounded. Indeed, I think there is reason to believe that at least some of the natural abstractions work, especially the natural abstraction hypothesis actually sorts of holds true for today's AI, and thus is the most likely out of the theoretical/agent-foundation approaches to work (I'm usually critical to agent foundations, but John Wentworth's work is an exception that I'd like funding for).
For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it's likely that deeper factors are at play than just shallow similarity:
I'm wary of a possible equivocation about what the "natural abstraction hypothesis" means here.
If we are referring to the redundant information hypothesis and various kinds of selection theorems, this is a mathematical framework that could end up being correct, is not at all ungrounded, and Wentworth sure seems like the man for the job.
But then you are still left with the task of grounding this framework in physical reality to allow you to make correct empirical predictions about and real-world interventions on what you will see from more advanced models. Our physical world abstracting well seems plausible (not necessarily >50% likely), and these abstractions being "natural" (e.g., in a category-theoretic sense) seems likely conditional on the first clause of this sentence being true, but I give an extremely low probability to the idea that these abstractions will be used by any given general intelligence or (more to the point) advanced AI model to a large and wide enough extent that retargeting the search is even close to possible.
And indeed, it is the latter question that represents the make-or-break moment for natural abstractions' theory of change, for it is only when ...
I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work
I stand by my statement and don't think anything about the o-series model invalidates it.
And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)
There's a world of difference between "you can get better results by thinking longer" (yeah, obviously this was going to happen) and "the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search" (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).
Re the OpenAI o-series and search, my initial prediction is that Q*/MCTS search will work well on problems that are easy to verify and and easy to get training data for, and not work if either of these 2 conditions are violated, and secondarily will be reliant on the model having good error correction capabilities to use the search effectively, which is why I expect we can make RL capable of superhuman performance on mathematics/programming with some rather moderate schlep/drudge work, and I also expect cost reductions such that it can actually be practical, but I'm only giving a 50/50 chance by 2028 for superhuman performance as measured by benchmarks in these domains.
I think my main difference from you, Thane Ruthenis is I expect costs to reduce surprisingly rapidly, though this is admittedly untested.
This will accelerate AI progress, but not immediately cause an AI explosion, though in the more extreme paces this could create something like a scenario where programming companies are founded by a few people smartly managing a lot of programming AIs, and programming/mathematics experiencing something like what happened to the news industry from the rise of the internet, where ther...
Epistemic status: This is a work of satire. I mean it---it is a mean-spirited and unfair assessment of the situation. It is also how, some days, I sincerely feel.
A minivan is driving down a mountain road, headed towards a cliff's edge with no guardrails. The driver floors the accelerator.
Passenger 1: "Perhaps we should slow down somewhat."
Passengers 2, 3, 4: "Yeah, that seems sensible."
Driver: "No can do. We're about to be late to the wedding."
Passenger 2: "Since the driver won't slow down, I should work on building rocket boosters so that (when we inevitably go flying off the cliff edge) the van can fly us to the wedding instead."
Passenger 3: "That seems expensive."
Passenger 2: "No worries, I've hooked up some funding from Acceleration Capital. With a few hours of tinkering we should get it done."
Passenger 1: "Hey, doesn't Acceleration Capital just want vehicles to accelerate, without regard to safety?"
Passenger 2: "Sure, but we'll steer the funding such that the money goes to building safe and controllable rocket boosters."
The van doesn't slow down. The cliff looks closer now.
Passenger 3: [looking at what Passenger 2 is building] "Uh, haven't you just made a faster engine?"
Passen...
unfortunately, the disanalogy is that any driver who moves their foot towards the brakes is almost instantly replaced with one who won't.
Driver: My map doesn't show any cliffs
Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile
Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?
Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.
Passenger 1: You don't have it set to show terrain.
Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.
Passenger 7: Don't you live in a different state?
Passenger 5: The road seems to be going up into the mountains, though all the curves I can see from here are gentle and smooth.
Driver and all passengers in unison: Shut up passenger 5, we're trying to figure out if we're going to fall off a cliff here, and if so what we should do about it.
Passenger 7: Anyway, I think what we really need to do to ensure our safety is to outlaw automobiles entirely.
Passenger 3: The highest point on Earth is 8849m above sea level, and the lowest point is 430 meters below sea level, so the cliff in front of us could be as high as 9279m.
I think I do agree with some points in this post. This failure mode is the same as the one I mentioned about why people are doing interpretability for instance (section Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high), and I do think that this generalizes somewhat to whole field of alignment. But I'm highly skeptical that recruiting a bunch of physicists to work on alignment would be that productive:
hi John,
Let's talk about a hypothetical physicist-turned-alignment researcher who, (for no particular reason), we'll call John. This researcher needs to periodically repeat to himself that Only Physicists do Real Work, but also needs to write an Alignment Post-Mortem. Maybe he needs the Post-Mortem as a philosophical fig leaf to justify his own program lack of empirical grounding, or maybe he's honestly concerned but not very good at seeing the Planck in his own eye. Either way, he meets with two approaches to writing critiques - let's call them (again for no particular reason) "careful charitable engagement" and "provocative dismissal." Both can be effective, but they have very different impacts on community discourse. It turns out that careful engagement requires actually understanding what other researchers are doing, while provocative dismissal lets you write spicy posts from your armchair. Lo and behold, John endorses provocative dismissal as his Official Blogging Strategy, and continues writing critiques. (Actually the version which eventually emerges isn't even fully researched, it's a quite different version which just-so-happens to be even more dismissive of work he hasn't closely followed.)
Glad to hear you enjoyed ILIAD.
Best,
AGO
Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.
I don't want to respond to the examples rather than the underlying argument, but it seems necessary here: this seems like a massively overconfident claim about ELK and debate that I don't believe is justified by popular theoretical worst-case objections. I think a common cause of "worlds where iterative design fails" is "iterative design seems hard and we stopped at the first apparent hurdle." Sure, in some worlds we can rule out entire classes of solutions via strong theoretical arguments (e.g., "no-go theorems"); but that is not the case here. If I felt confident that the theory-level objections to ELK and debate ruled out hodge-podge solutions, I would abandon hope in these research directions and drop them from the...
I am a physics PhD student. I study field theory. I have a list of projects I've thrown myself at with inadequate technical background (to start with) and figured out. I've convinced a bunch of people at a research institute that they should keep giving me money to solve physics problems. I've been following LessWrong with interest for years. I think that AI is going to kill us all, and would prefer to live for longer if I can pull it off. So what do I do to see if I have anything to contribute to alignment research? Maybe I'm flattering myself here, but I sound like I might be a person of interest for people who care about the pipeline. I don't feel like a great candidate because I don't have any concrete ideas for AI research topics to chase down, but it sure seems like I might start having ideas if I worked on the problem with somebody for a bit. I'm apparently very ok with being an underpaid gopher to someone with grand theoretical ambitions while I learn the material necessary to come up with my own ideas. My only lead to go on is "go look for something interesting in MATS and apply to it" but that sounds like a great way to end up doing streetlight research because I don't un...
Going to MATS is also an opportunity to learn a lot more about the space of AI safety research, e.g. considering the arguments for different research directions and learning about different opportunities to contribute. Even if the "streetlight research" project you do is kind of useless (entirely possible), doing MATS is plausibly a pretty good option.
MATS will push you to streetlight much more unless you have some special ability to have it not do that.
Do you mean during the program? Sure, maybe the only MATS offers you can get are for projects you think aren't useful--I think some MATS projects are pretty useless (e.g. our dear OP's). But it's still an opportunity to argue with other people about the problems in the field and see whether anyone has good justifications for their prioritization. And you can stop doing the streetlight stuff afterwards if you want to.
Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P
Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P
Yes it would! It would eat up motivation and energy and hope that they could have put towards actual research. And it would put them in a social context where they are pressured to orient themselves toward streetlighty research--not just during the program, but also afterward. Unless they have some special ability to have it not do that.
Without MATS: not currently doing anything directly useful (though maybe indirectly useful, e.g. gaining problem-solving skill). Could, if given $30k/year, start doing real AGI alignment thinking from scratch not from scratch, thereby scratching their "will you think in a way that unlocks understanding of strong minds" lottery ticket that each person gets.
With MATS: gotta apply to extension, write my LTFF grant. Which org should I apply to? Should I do linear probes software engineering? Or evals? Red teaming? CoT? Constitution? Hyperparamter gippity? Honeypot? Scaling supervision? Superalign, better than regular align? Detecting deception?
Obviously I disagree with Tsvi regarding the value of MATS to the proto-alignment researcher; I think being exposed to high quality mentorship and peer-sourced red-teaming of your research ideas is incredibly valuable for emerging researchers. However, he makes a good point: ideally, scholars shouldn't feel pushed to write highly competitive LTFF grant applications so soon into their research careers; there should be longer-term unconditional funding opportunities. I would love to unlock this so that a subset of scholars can explore diverse research directions for 1-2 years without 6-month grant timelines looming over them. Currently cooking something in this space.
The first step would probably be to avoid letting the existing field influence you too much. Instead, consider from scratch what the problems of minds and AI are, how they relate to reality and to other problems, and try to grab them with intellectual tools you're familiar with. Talk to other physicists and try to get into exploratory conversation that does not rely on existing knowledge. If you look at the existing field, look at it like you're studying aliens anthropologically.
You could consider doing MATS as "I don't know what to do, so I'll try my hand at something a decent number of apparent experts consider worthwhile and meanwhile bootstrap a deep understanding of this subfield and a shallow understanding of a dozen other subfields pursued by my peers." This seems like a common MATS experience and I think this is a good thing.
On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.
At least from the MATS perspective, this seems quite wrong. Only ~20% of MATS scholars in the last ~4 programs have been undergrads. In the most recent application round, the dominant sources of applicants were, in order, personal recommendations, X posts, AI Safety Fundamentals courses, LessWrong, 80,000 Hours, then AI safety student groups. About half of accepted scholars tend to be students and the other half are working professionals.
Robin Hanson recently wrote about two dynamics that can emerge among individuals within an organisations when working as a group to reach decisions. These are the "outcome game" and the "consensus game."
In the outcome game, individuals aim to be seen as advocating for decisions that are later proven correct. In contrast, the consensus game focuses on advocating for decisions that are most immediately popular within the organization. When most participants play the consensus game, the quality of decision-making suffers.
The incentive structure within an organization influences which game people play. When feedback on decisions is immediate and substantial, individuals are more likely to engage in the outcome game. Hanson argues that capitalism's key strength is its ability to make outcome games more relevant.
However, if an organization is insulated from the consequences of its decisions or feedback is delayed, playing the consensus game becomes the best strategy for gaining resources and influence.
This dynamic is particularly relevant in the field of (existential) AI Safety, which needs to develop strategies to mitigate risks before AGI is developed. Currently, we have zero con...
Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.
Actually, I now suspect this is to a significant extent disinformation. You can tell when ideas make sense if you think hard about them. There's plenty of feedback, that's not already being taken advantage of, at the level of "abstract, high-level, philosophy of mind", about the questions of alignment.
TLDR:
First, let me establish that theorists very often disagree on what the hard parts of the alignment problem are, precisely because not enough theoretical and empirical progress has been made to generate agreement on them. All the lists of "core hard problems" OP lists are different, and Paul Christiano famously wrote a 27-point list of disagreements on Eliezer's. This means that most people's views of the problem are wrong, and should they stick to their guns they might perseverate on either an irrelevant problem or a doomed approach.
I'd guess that historically perseveration has been an equally large problem as streetlighting among alignment researchers. Think of all the top alignment researchers in 2018 and all the agendas that haven't seen much progress. Decision theory should probably not take ~30% o...
Thank you for writing this post. I'm probably slightly more optimistic than you on some of the streetlighting approaches, but I've also been extremely frustrated that we don't have anything better, when we could.
That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair.
I've seen discussions from people who I vehemently disagreed that did similar things and felt very frustrated by not being able to defend my views with greater bandwidth. This isn't a criticism of this post - I think a non-zero number of those are plausibly good - but: I'd be happy to talk at length with anyone who feels like this post is unfair to them, about our respective views. I likely can't do as good a job as John can (not least because our models aren't identical), but I probably have more energy for talking to alignment researchers[1].
...On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportio
Here's a Facebook post by Yann LeCun from 2017 which has a similar message to this post and seems quite insightful:
...My take on Ali Rahimi's "Test of Time" award talk at NIPS.
Ali gave an entertaining and well-delivered talk. But I fundamentally disagree with the message.
The main message was, in essence, that the current practice in machine learning is akin to "alchemy" (his word).
It's insulting, yes. But never mind that: It's wrong!
Ali complained about the lack of (theoretical) understanding of many methods that are currently used in ML, particularly in deep learning.
Understanding (theoretical or otherwise) is a good thing. It's the very purpose of many of us in the NIPS community.
But another important goal is inventing new methods, new techniques, and yes, new tricks.
In the history of science and technology, the engineering artifacts have almost always preceded the theoretical understanding: the lens and the telescope preceded optics theory, the steam engine preceded thermodynamics, the airplane preceded flight aerodynamics, radio and data communication preceded information theory, the computer preceded computer science.
Why? Because theorists will spontaneously study "si
My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.
Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.
As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn't apply to them. The process of developing an ML model isn't very similar to...
Given that you speak with such great confidence that historical arguments for AI X-risk were not grounded, can you give me any "grounded" predictions about what superintelligent systems will do? (which I think we both agree is ultimately what will determine the fate of the world and universe)
If you make some concrete predictions then we can start arguing about the validity, but I find this kind of "mightier than thou" attitude where people keep making ill-defined statements like "these things are theoretical and don't apply", but without actually providing any answers to the crucial questions.
Indeed, not only that, I am confident that if you were to try to predict what will happen with superintelligence, you would very quickly start drawing on the obvious analogies to optimizers and dutch book arguments and evolution and goodhearts law, because we really don't have anything better.
I... am not very impressed by these predictions.
First, I don't think these are controversial predictions on LW (yes, a few people might disagree with him, but there is little boldness or disagreement with widely held beliefs in here), but most importantly, these predictions aren't about anything I care about. I don't care whether the world-model will have a single unambiguous self-versus-world boundary, I care whether the system is likely to convert the solar system into some form of computronium, or launch Dyson probes, or eliminate all potential threats and enemies, or whether the system will try to subvert attempts at controlling it, or whether it will try to amass large amounts of resources to achieve its aims, or be capable of causing large controlled effects via small information channels, or is capable of discovering new technologies with great offensive power.
The only bold prediction here is maybe "the behavior of the ASI will be a collection of heuristics", and indeed would take a bet against this. Systems under reflection and extensive self-improvement stop being well-described by contextual heuristics, and it's likely ASI will both self-reflect and self-improve (as...
For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.
It's one of the most standard results in ML that neural nets are universal function approximators. In the context of that proof, ML de-facto also assumes that you have infinite computing power. It's just a standard tool in ML, AI or CS to see what models predict when you take them to infinity. Indeed, it's really one of the most standard tools in the modern math toolbox, used by every STEM discipline I can think of.
Similarly, separating the boundary between internal decision processes and the outside world continues to be a standard assumption in ML. It's really hard to avoid, everything gets very loopy and tricky, and yes, we have to deal with that loopiness and trickiness, but if anything, agent foundations people were the actual people trying to figure out how to handle that loopiness and trickiness, whereas the ML community really has done very little to handle it. In contrary to your statement here, people on LW have been for years pointing out how embedded agency is really important, and been dismissed by active practitioners because they think the cartesian boundary here is just fine for "real" and "grounded" applications like "predicting the next token" which clearly don't have relevance to these weird and crazy scenarios about power-seeking AIs developing contextual awareness that you are talking about.
This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.
Yep. This post is not for me but I'll say a thing that annoyed me anyway:
... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.
Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. "sensor tampering") or giving novel arguments that problems are difficult is socially rewarded.)
Does this actually happen?
Yes, absolutely. Five years ago, people were more honest about it, saying ~explicitly and out loud "ah, the real problems are too difficult; and I must eat and have friends; so I will work on something else, and see if I can get funding on the basis that it's vaguely related to AI and safety".
Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.
At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists are focused on searching for their metaphorical keys under the streetlight. The memetically-successful strategy in the field is to tackle problems which are easy, rather than problems which are plausible bottlenecks to humanity’s survival. That pattern of memetic fitness looks likely to continue to dominate the field going forward.
This post is on my best models of how we got here, and what to do next.
What This Post Is And Isn't, And An Apology
This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post. In particular, probably the large majority of people in the field have some story about how their work is not searching under the metaphorical streetlight, or some reason why searching under the streetlight is in fact the right thing for them to do, or [...].
The kind and prosocial version of this post would first walk through every single one of those stories and argue against them at the object level, to establish that alignment researchers are in fact mostly streetlighting (and review how and why streetlighting is bad). Unfortunately that post would be hundreds of pages long, and nobody is ever going to get around to writing it. So instead, I'll link to:
(Also I might link some more in the comments section.) Please go have the object-level arguments there rather than rehashing everything here.
Next comes the really brutally unkind part: the subject of this post necessarily involves modeling what's going on in researchers' heads, such that they end up streetlighting. That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair. And then when they try to defend themselves in the comments below, I'm going to say "please go have the object-level argument on the posts linked above, rather than rehashing hundreds of different arguments here". To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post, I am basically treating you like a child and saying "quiet please, the grownups are talking", but the grownups in question are talking about you and in fact I'm trash talking your research pretty badly, and that is not fair to you at all.
But it is important, and this post just isn't going to get done any other way. Again, I'm sorry.
Why The Streetlighting?
A Selection Model
First and largest piece of the puzzle: selection effects favor people doing easy things, regardless of whether the easy things are in fact the right things to focus on. (Note that, under this model, it's totally possible that the easy things are the right things to focus on!)
What does that look like in practice? Imagine two new alignment researchers, Alice and Bob, fresh out of a CS program at a mid-tier university. Both go into MATS or AI Safety Camp or get a short grant or [...]. Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.
... of course (I would say) Bob has not made any progress toward solving any probable bottleneck problem of AI alignment, but he has tangible outputs and is making progress on something, so he'll probably keep going.
And that's what the selection pressure model looks like in practice. Alice is working on something hard, correctly realizes that she has no traction, and stops. (Or maybe she just keeps spinning her wheels until she burns out, or funders correctly see that she has no outputs and stop funding her.) Bob is working on something easy, he has tangible outputs and feels like he's making progress, so he keeps going and funders keep funding him. How much impact Bob's work has impact on humanity's survival is very hard to measure, but the fact that he's making progress on something is easy to measure, and the selection pressure rewards that easy metric.
Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.
Selection and the Labs
Here's a special case of the selection model which I think is worth highlighting.
Let's start with a hypothetical CEO of a hypothetical AI lab, who (for no particular reason) we'll call Sam. Sam wants to win the race to AGI, but also needs an AI Safety Strategy. Maybe he needs the safety strategy as a political fig leaf, or maybe he's honestly concerned but not very good at not-rationalizing. Either way, he meets with two prominent AI safety thinkers - let's call them (again for no particular reason) Eliezer and Paul. Both are clearly pretty smart, but they have very different models of AI and its risks. It turns out that Eliezer's model predicts that alignment is very difficult and totally incompatible with racing to AGI. Paul's model... if you squint just right, you could maybe argue that racing toward AGI is sometimes a good thing under Paul's model? Lo and behold, Sam endorses Paul's model as the Official Company AI Safety Model of his AI lab, and continues racing toward AGI. (Actually the version which eventually percolates through Sam's lab is not even Paul's actual model, it's a quite different version which just-so-happens to be even friendlier to racing toward AGI.)
A "Flinching Away" Model
While selection for researchers working on easy problems is one big central piece, I don't think it fully explains how the field ends up focused on easy things in practice. Even looking at individual newcomers to the field, there's usually a tendency to gravitate toward easy things and away from hard things. What does that look like?
Carol follows a similar path to Alice: she's interested in the Eliciting Latent Knowledge problem, and starts to dig into it, but hasn't really understood it much yet. At some point, she notices a deep difficulty introduced by sensor tampering - in extreme cases it makes problems undetectable, which breaks the iterative problem-solving loop, breaks ease of validation, destroys potential training signals, etc. And then she briefly wonders if the problem could somehow be tackled without relying on accurate feedback from the sensors at all. At that point, I would say that Carol is thinking about the real core ELK problem for the first time.
... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems. At that point, I would say that Carol is streetlighting.
It's the reflexive flinch which, on this model, comes first. After that will come rationalizations. Some common variants:
... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.
Which brings us to the "what to do about it" part of the post.
What To Do About It
Let's say we were starting a new field of alignment from scratch. How could we avoid the streetlighting problem, assuming the models above capture the core gears?
First key thing to notice: in our opening example with Alice and Bob, Alice correctly realized that she had no traction on the problem. If the field is to be useful, then somewhere along the way someone needs to actually have traction on the hard problems.
Second key thing to notice: if someone actually has traction on the hard problems, then the "flinching away" failure mode is probably circumvented.
So one obvious thing to focus on is getting traction on the problems.
... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall. I'm picturing here e.g. the sort of crowd at the ILIAD conference; these were people who mostly did not seem at risk of flinching away, because they saw routes to tackle the problems. (Though to be clear, though ILIAD was a theory conference, I do not mean to imply that it's only theorists who ever have any traction.) And they weren't being selected away, because many of them were in fact doing work and making progress.
Ok, so if there are a decent number of people who can get traction, why do the large majority of the people I talk to seem to be flinching away from the hard parts?
How We Got Here
The main problem, according to me, is the EA recruiting pipeline.
On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.
... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.
Who To Recruit Instead
We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called "uncannily similar to a recommendation to readers to roll up their own doctorate program". Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)
As an alternative to recruiting people who have the skills already, one could instead try to train people. I've tried that to some extent, and at this point I think there just isn't a substitute for years of technical study. People need that background knowledge in order to see footholds on the core hard problems.
Integration vs Separation
Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.
This is a problem which might solve itself, if a bunch of physicists move into alignment work. Heck, we've already seen it to a very limited extent with the ILIAD conference itself. Turns out people working on the core problems want to talk to other people working on the core problems. But the process could perhaps be accelerated a lot with more dedicated venues.