why we mostly don't need to worry about AI
This topic is poorly understood, very high confidence is obviously wrong for any claim that's not exceptionally clear. Absence of doom is not such a claim, so the need to worry isn't going anywhere.
Without sufficient clarity, which humanity doesn't possess on this topic, no amount of somewhat confused arguments is sufficient for the kind of certainty that makes the remaining risk of extinction not worth worrying about. It's important to understand and develop what arguments we have, but in their present state they are not suitable for arguing this particular case outside their own assumption-laden frames.
When reunited with unknown unknowns outside their natural frames, such arguments might plausibly make it reasonable to believe the risk of extinction is as low as 10%, or as high as 90%, but nothing more extreme than that. Nowhere across this whole range of epistemic possibilities is a situation that we "mostly don't need to worry about".
Downvote for being absurdly overconfident, and thereby harming the whole direction of more optimism on alignment. I'd downvote Eliezer for the same reason on his 99.99% doom arguments in public; they are visibly silly, making the whole direction seem silly by association.
In both cased, there are too many unknown unknowns to have confidences remotely that high. And you've added way more silly zeros than EY, despite having looser arguments.
This is a really important topic; we need serious discussion of how to really think about alignment difficulty. This is a serious attempt, but it's just not realistically humble. It also seems to be ignoring the cultural norm and explicit stated goal of writing to inform, not to persuade, on LW.
So, I look forward to your next iteration, improved by the feedback on this post!
I believe the security mindset is inappropriate for AI
I think that's because AI today feels like a software project akin to building a website. If it works, that's nice, but if it doesn't work it's no big deal.
Weak systems have safe failures because they are weak, not because they are safe. If you piss off a kitten, it will not kill you. If you piss off an adult tiger...
The optimistic assumptions laid out in this post don't have to fail in every possible case for us to be in mortal danger. They only have to fail in one set of circumstances that someone actualizes. And as long as things keep looking like they are OK, people will continue to push the envelope of risk to get more capabilities.
We have already seen AI developers throw caution to the wind in many ways (releasing weights as open source, connecting AI to the internet, giving it access to a command prompt) and things seem OK for now so I imagine this will continue. We have already seen some psycho behavior from Sydney too. But all these systems are weak reasoners and they don't have a particularly solid grasp on cause and effect in the real world.
We are certainly in a better position with respect to winning than when I started posting on this website. To me the big wins are (1) that safety is a mainstream topic and (2) that the AIs learned English before they learned physics. But I don't regard those as sufficient for human survival.
Do you just like not believe that AI systems will ever become superhumanly strong? That once you really crank up the power (via hardware and/or software progress), you'll end up with something that could kill you?
Read what I wrote above: current systems are safe because they're weak, not safe because they're inherently safe.
Security mindset isn't necessary for weak systems because weak systems are not dangerous.
It's not just about "being taken seriously", although that's a nice bonus - it's also about getting shared understanding about what makes programs secure vs. insecure. You need a method of touching grass so that researchers have some idea of whether or not they're making progress on the real issues.
No the rigidity is what makes a system error prone i.e. brittle. If you don’t specify the solution exactly, the machine won’t solve the problem. Classic computer programs can’t generalize.
The OP makes a point how you can double a model size and it will work well but if you double a computer programs binary size with unused lines of code you can get all sorts of weird errors. Even if none of that extra size is ever used.
An analogy is trying to write a symbolic logic program to emulate an LLM. (Ie with only if statements and for loops) or trying to make a self driving car with Boolean logic.
If I flip one single bit in a computer program, it will probably catastrophically fail and crash the whole computer. However removing random weights won’t do much to an LLM.
a little tangent on the flipping a bit:
Flipping a bit in the actual binary itself (the thing the computer reads to run the program) will probably cause the computer to access a part of itself it wasn’t supposed to and immediately crash.
Changing a letter in a computer program that humans write will almost certainly cause the program to not compile.
At this point it is not clear to me what you mean by security mindset. I understand by it what Bruce Schneier described in the article I linked, and what Eliezer describes here (which cites and quotes from Bruce Schneier). You have cited QuintinPope, who also cites the Eliezer article, but gets from it this concept of "security mindset": "The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions". From this and his further words about the concept, he seems to mean something like "programming mindset", i.e. good practice in software engineering. Only if I read both you and him as using "security mindset" to mean that can I make sense of the way you both use the term.
But that is simply not what "security mindset" means. Recall that Schneier's article began with the example of a company selling ant farms by mail order, nothing to do with software. After several more examples, only one of which concerns computers, he gives his own short characterisation of the concept that he is talking about:
...the security mindset involves thinking about how things can be made to fail. It involves thinki
I’m pretty confused about almost everything you said about “innate reward system”.
My view is: the relevant part of the human innate reward system (the part related to compassion, norm-following, etc.) consists of maybe hundreds of lines of code, and nobody knows what they are, and I would feel better if we did. (And that happens to be my own main research interest.)
Whereas your view seems to be: umm, I’m not sure, I’m gonna say things and you can correct me. Maybe you think that (1) the innate reward system is simple, (2) when we do RLHF, we are providing tens of thousands of samples of what the innate reward system would do in different circumstances, (3) and therefore ML will implicitly interpolate how the innate reward system works from that data, (4) …and this will continue to extrapolate to norm-following behavior etc. even in out-of-distribution situations like inventing new society-changing technology. Is that right? (I’m stating this possible argument without endorsing or responding to it, I’m still at the trying-to-understand-you phase.)
On the topic of security mindset, the thing that the LW community calls "security mindset" isn't even an accurate rendition of what computer security people would call security mindset. As noted by lc, actual computer security mindset is POC || GTFO, or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you're maybe worried about being a real problem because you are almost certain to be privileging the hypothesis.
In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:
1) Person A says to Person B, "I think your software has X vulnerability in it." Person B says, "This is a highly specific scenario, and I suspect you don't have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me."
2) Person B says to Person A, "Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I'm so confident, I give it a 99.99999%+ chance." Person A says, "I can't specify the exact vulnerability your software might have without it in front of me, but I'm fairly sure this confidence is unwarranted. In general it's easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn't because I think X will actually be the vulnerability, I'm just trying to be illustrative."
Story 1 seems to be the case where "POC or GTFO" is justified. Story 2 seems to be the case where "security mindset" is justified.
It's very different to suppose a particula...
At the very least I think it would be more accurate to say “one aspect of actual computer security mindset is POC || GTFO”. Right? Are you really arguing that there’s nothing more to it than that?? That seems insane to me.
Even leaving that aside, here’s a random bug thread:
Mozilla developers identified and fixed several stability bugs in the browser engine used in Firefox and other Mozilla-based products. Some of these crashes showed evidence of memory corruption under certain circumstances and we presume that with enough effort at least some of these could be exploited to run arbitrary code. [emphasis added]
IIUC they treated these crashes as a security vulnerability, not a mere usability problem, and thus did things like not publicly disclosing the details until they had a fix ready to go, categorizing the fix as a high-priority security update, etc.
If your belief is that “actual computer security mindset is POC||GTFO”, then I think you’d have to say that these Mozilla developers do not have computer security mindset, and instead were being silly and overly paranoid. Is that what you think?
You're right that this is definitely not "security mindset". Iceman is distorting the point of the original post. But also, the reason Mozilla's developers can do that and get public credit for it is partially because the infosec community has developed tens of thousands of catastrophic RCE's from very similar exploit primitives, and so there is loads of historical evidence that those particular kinds of crashes lead to exploitable bugs. Alignment researchers lack the same shared understanding - they're mostly philosopher-mathematicians with no consensus even among themselves about what the real issues are, and so if one tries to claim credit for averting catastrophe in a similar situation it's impossible to tell if they're right.
POC || GTFO is not "security mindset", it's a norm. It's like science in that it's a social technology for making legible intellectual progress on engineering issues, and allows the field to parse who is claiming to notice security issues to signal how smart they are vs. who is identifying actual bugs. But a lack of "POC || GTFO" culture doesn't tell you that nothing is wrong, and demanding POCs for everything obviously doesn't mean you understand what is and isn't secure. Or to translate that into lesswrongese, reversed stupidity is not intelligence.
Citation needed? The one computer security person I know who read Yudkowsky's post said it was a good description of security mindset. POC||GTFO sounds useful and important too but I doubt it's the core of the concept.
Also, if the toy models, baby-AGI-setups like AutoGPT, and historical examples we've provided so far don't meet your standards for "example of the thing you're maybe worried about" with respect to AGI risk, (and you think that we should GTFO until we have an example that meets your standards) then your standards are way too high.
If instead POC||GTFO applied to AGI risk means "we should try really hard to get concrete, use formal toy models when possible, create model organisms to study, etc." then we are already doing that and have been.
On POCs for misalignment, specifically for goal misgeneralization, there are pretty fundamental differences between what was shown and what was predicted so far, and one of them is that the train and test behavior in different environments are similar or the same, while in goal misgeneralization speculations, the train and test behavior are wildly different:
Rohin Shah has a comment on why most POCs aren't that great here:
For white box vs black box, after further discussion I wound up feeling like people just use the term “black box” differently in different fields, and in practice maybe I’ll just “black box” and “white box” going forward. Hopefully we can all agree on:
If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer.
And likewise we can surely all agree that future AI programmers will be able to see the weights and perform SGD.
This whole post seems to be about accident risk, under the assumption that competent programmers are trying in good faith to align AI to “human values”. It’s fine for you to write a blog post on that—it’s an important and controversial topic! But it’s a much narrower topic than “AI safety”, right? AI safety includes lots of other things too—like bad actors, or competitive pressures to make AIs that are increasingly autonomous and increasingly ruthless, or somebody making ChaosGPT just for the lols, etc. etc.
One can argue that algorithmic & hardware improvements will never ever be enough to put human-genius-level human-speed AGI in the hands of tons of ordinary people e.g. university students with access to a cluster.
Or, one can argue that tons of ordinary people will get such access sooner or later, but meanwhile large institutional actors will have super-duper-AGIs, and they will use them to make the world resilient against merely human-genius-level-chaosGPTs, somehow or other.
Or, one can argue that ordinary people will never be able to do stupid things with human-genius-level AGIs because the government (or an AI singleton) will go around confiscating all the GPUs in the world or monitoring how they’re used with a keylogger and instant remote kill-switch or whatever.
As it happens, I’m pretty pessimistic about all of those things, and therefore I do think lols are a legit concern.
(Also, “just for the lols” is not the only way to get ChaosGPT; another path is “We should do this to better understand and study possible future threats”, but then fail to contain it. Large institutions could plausibly do that. If you disagree—if you’re thinking “nobody would be so stupid as to do that”—note the existence of gain-of-function research, lab leaks, etc. in biology.)
I've upvoted this post because it's a good collection of object-level, knowledgeable, serious arguments, even though I disagree with most of them and strongly disagree with the bottom line conclusion.
There is a good analogy between genetic brain evolution and technological AGI evolution. In both cases there is a clear bi-level optimization, with the inner optimizer using a very similar UL/RL intra-lifetime SGD (or SGD-like) algorithm.
The outer optimizer of genetic evolution is reasonably similar to the outer optimizer of technological evolution. The recipe which produces an organic brain is a highly compressed encoding or low frequency prior on the brain architecture along with a learning algorithm to update the detailed wiring during lifetime training. The genes which encode the brain architectural prior and learning algorithms are very close analogically to the 'memes' which are propagated/exchanged in ML papers and encode AI architectural prior and learning algorithms (ie the initial pytorch code etc).
The key differences are mainly just that memetic evolution is much faster - like an amplified artificial selection and genetic engineering process. For tech evolution a large number of successful algorithm memes from many different past experiments can be flexibly recombined in a single new experiment, and the process guiding this recombination and selection is itself runni...
I definitely think that LW might not realize that AI is on an S-curve right now.
AI is obviously on an S-curve, since eventually you run out of energy to feed into the system. But the top of that S-curve is so far beyond human intelligence, that this fact is basically irrelevant when considering AI safety.
The arguments about fundamental limits of computation (halting problem,etc) also are irrelevant for similar reasons. Humans can’t even solve BB(6).
I just saw this post and cannot parse it at all. You first say that you have removed the 9s of confidence. Then the next paragraph talks about a 99.9… figure. Then there are edit and quote paragraphs and I do not know whether these are your views or other or whether you endorse them.
I believe getting Friendly AI is really really likely, closer to 99.99999%+ of the time
I think it'd make sense to clarify what you mean here, since the following are very different:
I assume you mean something more like the latter.
In that case it'd probably be useful to give a sense of your actual confidence in the 99.99999%+ claim.
"Mostly don't need to worry" would imply extremely high confidence.
Or do you mean something like "In most worlds it'll be clear in retrospect that we needn't have worried"?
Ok, well thanks for clarifying.
I'd assumed you meant the second.
Some reasons I think that this confidence level is just plain silly (not an exhaustive list!):
that the reason humans generalized correctly to having human values and didn't just trick their reward system isn't that special
This is a tautology, not an example of successful alignment:
Humans trick their reward systems as much as humans trick their reward systems.
Imagine a case where we did "trick our reward system". In such a case the human values we'd infer would be those that we'd infer from all the actions we were taking - including the actions that were "tricking our reward system".
We would then observe that we'd generalized entirely correctly with respect to the values we inferred. From this we learn that things tend to agree with themselves. This tells us precisely nothing about alignment.
I note for clarity that it occurs to me to say:
Indeed we do observe some humans doing what most of us would think of as tricking their reward systems (e.g. self-destructive drug addictions).
You may respond "Ah, but that's a small proportion of people - most people don't do that!" - at which point we're back to tautology: what most people do will determine what is meant by "human values". Most people are normal, since that's how 'normal' is defined.
The only possible evidence I could provi...
I don't think it's accidental - it seems to me that the tautology accurately indicates where you're confused.
"generalised correctly" makes an equivalent mistake: correctly compared to what? Most people generalise according to the values we infer from the actions of most people? Sure. Still a tautology.
Regarding security mindset, I think that where it really kicks in is when you have a system utilising its intelligence to work around any limitations such that you're no longer looking at a "broad, reasonable" distribution of space, but now a "very, specific" scenario that a powerful optimiser has pushed you towards. In that case, doing things like doubling the size may make your safety schemes if the AI now has the intelligence to get around it.
fe, but I feel like a lot of Lesswrongers are probably wrong in their assumption that AI progress will continue as it had after 2030,
Who thinks that? I don't think that. Ajeya doesn't think that.
In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.
I'm going to read this as "...1 new potential gradient hacking pathway" because I think that's what the section is mainly about. (It appears to me that throughout the section you're conflating mesa-optimization with gradient hacking, but that's not the main thing I want to talk about.)
The following quote indicates at least two potential avenues of gradient hacking: "In an RL context", "supervised learning with ada...
Thanks a lot for writing that post.
One question I have regarding fast takeoff is: don't you expect learning algorithms much more efficient than SGD to show up and accelerate a lot the rate of development of capabilities?
One "overhang' I can see it the fact that humans have written a lot of what they know how to do all kinds of task on the internet and so a pretty data efficient algo could just leverage this and fairly suddenly learn a ton of tasks quite rapidly. For instance, in context learning is way more data efficient than SGD in pre-training. Right no...
r one particular example, you can randomly double your training data, or the size of the model, and it will work usually just fine. A rocket would explode if you tried to double the size of your fuel tanks.
The analogy was about the alignment problem, not the capabilities problem.
A rocket won't get to the moon if you randomly double one of the variables used to navigate, like the amount of thrust applied in maneuvers or the angle of attack. (well, not unless you've built in good error-correction and redundancy etc.)
Good to see your point of view. The old arguments about AI doom are not convincing to me anymore, however getting alignment 100% right, whatever that means in no way guarantees a positive Singularity.
Should we be talking about concrete plans about that now? For example I believe with a slow takeoff if we don't get Neuralink or mind uploading, then our P(doom) -> 1 as the Super AI gets ever more ahead of us. The kind scenarios I can see
Have you uploaded a new version of this article? It have just been reading elsewhere about goal misgeneralisation and shutdown problem, so I'd be really interested to read the new version of this article.
Thanks for writing this! I strongly appreciate a well-thought out post in this direction.
My own level of worry is pretty dependent on a belief that we know and understand shaping NN behaviors much better than we do (values/goals/motivations/desires) (although I don't think eg chatGPT has any of the latter in the first place). Do you have thoughts on the distinction between behaviors and goals? In particular, do you feel like you have any evidence we know how to shape/create/guide goals and values, rather than just behaviors?
Arguments about inner misalignment work as arguments for optimism only inside "outer/inner alignment" framework, in deep learning version of it. If we have good outer loss function, such as closer to the minimum means better, then yes, our worries should be about weird inner misalignment issues. But we don't have good outer loss function so we kinda should hope for inner misalignment.
Evolution mostly can't transmit any bits from one generation to the next generation via genetic knowledge, or really any other way
http://allmanlab.caltech.edu/biCNS217_2008/PDFs/Meaney2001.pdf
Or, why we probably don't need to worry about AI.
So this post is partially a response to Amalthea's comment on how I simply claimed that my side is right, and I responded by stating that I was going for a short comment rather than having to make another very long comment on the issue.
https://www.lesswrong.com/posts/aW288uWABwTruBmgF/?commentId=r7s9JwqP5gt4sg4HZ#r7s9JwqP5gt4sg4HZ
This is the post where I won't try to claim that my side is right, and instead give evidence so I can properly collect my thoughts here. This will be a link-heavy post, and I'll reference a lot of concepts and conversations, so it will help if you have some light background on these ideas, but I will try to make everything intelligible to the lay/non-technical person.
This will be a long post, so get a drink and a snack.
Nate Soares suggests that a critical problem in AI safety is the sharp left turn, and the sharp left turn essentially is that capabilities generalize much more than the goals, ie it is basically goal misgeneralization plus fast takeoff:
My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.
And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.
So essentially the analogy is akin to AI is aligned in the training data, but in the test set, due to the limitations of the method of alignment, fail to generalize to the test set.
Here's the problem: We actually know why the sharp left turn happened, and the circumstances that led to the sharp left turn in humans won't reappear in AI training and AI progress.
Basically, the sharp left turn happened because the outer optimizer of evolution was billions of times less powerful than the inner search process like human lifetime learning, and the inner learners like us humans die after basically a single step, or at best 2-3 steps of the outer optimizer. Evolution mostly can't transmit as ,many bits from one generation to the next generation via it's tools, compared to cultural evolution, and the difference between their ability to transmit bits over certain time-scales is massive.
Once we had the ability to transmit some information via culture, that meant that given our ability to optimize billions of times more efficiently, we could essentially undergo a sharp left turn where capabilities spiked. But the only reason this happened was to quote Quintin Pope:
Once the inner learning processes become capable enough to pass their knowledge along to their successors, you get what looks like a sharp left turn. But that sharp left turn only happens because the inner learners have found a kludgy workaround past the crippling flaw where they all get deleted shortly after initialization.
This does not exist for AIs trained with SGD, and there is a much smaller gap between the outer optimizer SGD and the inner optimizer, with the difference being ~0-40x.
Here's the source for it below, and I'll explicitly quote it:
See also: Model Agnostic Meta Learning proposed a bi-level optimization process that used between 10 and 40 times more compute in the inner loop, only for Rapid Learning or Feature Reuse? to show they could get about the same performance while removing almost all the compute from the inner loop, or even by getting rid of the inner loop entirely.
Also, we can set the ratio of outer to inner optimization steps to basically whatever we want, which means that we can control the inner learner's rates of learning far better than evolution, meaning we can prevent a sharp left turn from happening.
A crux I have with Jan Kulevit is that to the extent that animals do have culture, it is much more limited than human culture, and that evolution largely has little ability to pass on traits non-culturally, and very critically this is a one-time inefficiency, there is no reason to assume a second source of massive inefficiency leading to a sharp left turn:
X4vier and particular illustrates this, and I'll show it below:
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=qYFkt2JRv3WzAXsHL
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=vETS4TqDPMqZD2LAN
This is because the alleged misgeneralization was not a situation where 1 AI was trained in an environment and maximized the correlates IGF, then in the new environment it encountered inputs that changed the goals such that it now misgeneralizes the goal to not pursue IGF anymore.
What happened is that evolution trained humans in one environment to optimize the correlates of IGF, then basically trained new humans in another environment, and they diverged.
Very critically, there were thousands of different systems/humans being trained on in drastically different environments, not 1 AI being trained on different environments like in modern AI training, so it's not a valid example of misgeneralization.
Some posts and quotes from Quintin Pope will help:
(Part 2, how this matters for analogies from evolution) Many of the most fundamental questions of alignment are about how AIs will generalize from their training data. E.g., "If we train the AI to act nicely in situations where we can provide oversight, will it continue to act nicely in situations where we can't provide oversight?"
When people try to use human evolutionary history to make predictions about AI generalizations, they often make arguments like "In the ancestral environment, evolution trained humans to do X, but in the modern environment, they do Y instead." Then they try to infer something about AI generalizations by pointing to how X and Y differ.
However, such arguments make a critical misstep: evolution optimizes over the human genome, which is the top level of the human learning process. Evolution applies very little direct optimization power to the middle level. E.g., evolution does not transfer the skills, knowledge, values, or behaviors learned by one generation to their descendants. The descendants must re-learn those things from information present in the environment (which may include demonstrations and instructions from the previous generation).
This distinction matters because the entire point of a learning system being trained on environmental data is to insert useful information and behavioral patterns into the middle level stuff. But this (mostly) doesn't happen with evolution, so the transition from ancestral environment to modern environment is not an example of a learning system generalizing from its training data. It's not an example of:
We trained the system in environment A. Then, the trained system processed a different distribution of inputs from environment B, and now the system behaves differently.
It's an example of:
We trained a system in environment A. Then, we trained a fresh version of the same system on a different distribution of inputs from environment B, and now the two different systems behave differently.
These are completely different kinds of transitions, and trying to reason from an instance of the second kind of transition (humans in ancestral versus modern environments), to an instance of the first kind of transition (future AIs in training versus deployment), will very easily lead you astray.
Two different learning systems, trained on data from two different distributions, will usually have greater divergence between their behaviors, as compared to a single system which is being evaluated on the data from the two different distributions. Treating our evolutionary history like humanity's "training" will thus lead to overly pessimistic expectations regarding the stability and predictability of an AI's generalizations from its training data.
Drawing correct lessons about AI from human evolutionary history requires tracking how evolution influenced the different levels of the human learning process. I generally find that such corrected evolutionary analogies carry implications that are far less interesting or concerning than their uncorrected counterparts. E.g., here are two ways of thinking about how humans came to like ice cream:
If we assume that humans were "trained" in the ancestral environment to pursue gazelle meat and such, and then "deployed" into the modern environment where we pursued ice cream instead, then that's an example where behavior in training completely fails to predict behavior in deployment.
If there are actually two different sets of training "runs", one set trained in the ancestral environment where the humans were rewarded for pursuing gazelles, and one set trained in the modern environment where the humans were rewarded for pursuing ice cream, then the fact that humans from the latter set tend to like ice cream is no surprise at all.
In particular, this outcome doesn't tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they'll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.
A comment by Quintin on why humans didn't actually misgeneralize to liking icecream:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/?commentId=sYA9PLztwiTWY939B
Edit from comments due to Steven Byrnes: The white-box definition I'm using in this post does not correspond to the intuitive definition of a white box, and instead refers to the computer analysis/security sense of the term.
These links will be the definitions of white box AI going forward for this post:
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=CLi5eBchYfXKZvXuD
The above arguments on why the Sharp Left Turn probably won't reappear in modern AI development, and why the claim that humans didn't misgeneralize is enough to land us out of the most doomy voices like Eliezer Yudkowsky, and in particular the removal of reasons to assume extreme misgeneralization lands us out of MIRI-sphere views, as well as arguably outside of 50% p(doom). But I wanted to argue that the chance of doom is way lower than that, so low that we mostly shouldn't be concerned about AI, and thus I have to provide a positive story of why AIs very likely are aligned, and I argue that AIs are white boxes and we are the innate reward system, in this context.
The key advantage we have over evolution is that unlike studying brains, we have full read-write access to their internals, and they're essentially a special type of computer program, and we already have ways to manipulate computer programs at essentially no cost to us. Indeed, this is why SGD and backpropagation works at all to optimize SGD. If the AI was a black box, SGD and backpropagation wouldn't work.
The innate reward system aligns us via whitebox methods, and the values that the reward system imprints on us is ridiculously reliable, where almost every human has empathy for friends and acquaintances, parental instincts, revenge etc.
This is shown in the link below:
(Here, we must take a detour and say that our reward system is ridiculously good at aligning us to survive, and the flaws like obesity in the modern world are usually surprisingly mild failures, in that the human isn't as capable of things as we thought, and this arguably implies that alignment failures in practice will look much more like capabilities failures, and passing the analogy back to the AI case, I basically don't expect X-risk, GCRs, or really anything more severe than say the AI messing up a kitchen, for example.)
Steven Byrnes raised the concern that if you don't know how to do the manipulation, then it does cost you to gain the knowledge.
Steven Byrnes's comment is linked here: https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=3xxsumjgHWoJqSzqw
Nora Belrose responded on what white boxing meant, as well as how people use SGD to automate the search so that the cost of manipulation in an overall sense is as low as possible:
https://twitter.com/norabelrose/status/1709603325078102394
I mean it in the computer security sense, where it refers to the observability of the source code of a program (Nora Belrose)
https://twitter.com/norabelrose/status/1709606248314998835
We can do better than IDA Pro & Ghidra by exploiting the differentability of neural nets, using SGD to locate the manipulations of NN weights that improve alignment the most
I’d be much more worried if we didn’t have SGD and were just evolving AGI in a sim or smth (Nora Belrose)
https://twitter.com/norabelrose/status/1709601025286635762
I’m pointing out that it’s a white box in the very literal sense that you can observe and manipulate everything that’s going on inside, and this is a far from trivial fact because you can’t do this with other systems we routinely align like humans or animals. (Nora Belrose)
https://twitter.com/norabelrose/status/1709603731413901382
No, I don’t agree this is a weakening. In a literal sense it is zero cost to analyze and manipulate the NNs. It may be greater than zero cost to come up with manual manipulations that achieve some goal. But that’s why we automate the search for manipulations using SGD (Nora Belrose)
Steven Byrnes argues that this could be due to differing definitions:
https://twitter.com/steve47285/status/1709655473941631430
I think that’s a black box with a button on the front panel that says “SGD”. We can talk all day about all the cool things we can do by pressing the SGD button. But it's still a button outside the box, metaphorically.
To me, “white box” would mean: If an LLM outputs A rather than B, and you ask me why, then I can always give you a reasonable answer. I claim that this is closer to how that term is normally used in practice.
(Yes I know, it’s not literally a button, it’s an input-output interface that also changes the black box internals.) (Steven Byrnes)
This is the response chain so that I could see why Nora Belrose and Steven Byrnes were disagreeing.
I ultimately think a potential difference is that for alignment purposes, the humans vs AI abstraction is not a very useful abstraction, and SGD vs the inner optimizer is the better abstraction here, and thus it doesn't matter whether AI progresses generally, it's the specific progress by humans + SGD vs the inner optimizer that's important, and thus the cost of manipulating AI values is quite low.
This leads to...
In general, a common disagreement with a lot of LWers is that there is very limited transfer of knowledge from the computer security field to AI, because AI is very different in ways that make the analogies inappropriate.
For one particular example, you can randomly double your training data, or the size of the model, and it will work usually just fine. A rocket would explode if you tried to double the size of your fuel tanks.
All of this and more is explained by Quintin below, but there are several big disanalogies between the AI field and the computer security field, so much so that I think that ML/AI is a lot like quantum mechanics, where we shouldn't port intuitions from other fields and expect them to work because of the weirdness of the domain:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/#Yudkowsky_mentions_the_security_mindset__
Similarly, I think that machine learning is not really like computer security, or rocket science (another analogy that Yudkowsky often uses). Some examples of things that happen in ML that don't really happen in other fields:
Models are internally modular by default. Swapping the positions of nearby transformer layers causes little performance degradation.
Swapping a computer's hard drive for its CPU, or swapping a rocket's fuel tank for one of its stabilization fins, would lead to instant failure at best. Similarly, swapping around different steps of a cryptographic protocol will, usually make it output nonsense. At worst, it will introduce a crippling security flaw. For example, password salts are added before hashing the passwords. If you switch to adding them after, this makes salting near useless.
We can arithmetically edit models. We can finetune one model for many tasks individually and track how the weights change with each finetuning to get a "task vector" for each task. We can then add task vectors together to make a model that's good at multiple of the tasks at once, or we can subtract out task vectors to make the model worse at the associated tasks.
Randomly adding / subtracting extra pieces to either rockets or cryptosystems is playing with the worst kind of fire, and will eventually get you hacked or exploded, respectively.
We can stitch different models together, without any retraining.
The rough equivalent for computer security would be to have two encryption algorithms A and B, and a plaintext X. Then, midway through applying A to X, switch over to using B instead. For rocketry, it would be like building two different rockets, then trying to weld the top half of one rocket onto the bottom half of the other.
Things often get easier as they get bigger. Scaling models makes them learn faster, and makes them more robust.
This is usually not the case in security or rocket science.
You can just randomly change around what you're doing in ML training, and it often works fine. E.g., you can just double the size of your model, or of your training data, or change around hyperparameters of your training process, while making literally zero other adjustments, and things usually won't explode.
Rockets will literally explode if you try to randomly double the size of their fuel tanks.
I don't think this sort of weirdness fits into the framework / "narrative" of any preexisting field. I think these results are like the weirdness of quantum tunneling or the double slit experiment: signs that we're dealing with a very strange domain, and we should be skeptical of importing intuitions from other domains.
I also believe that the epistemic differences between computer security and alignment is in computer security, there's an easy to check ground truth for whether a crypto-system is broken, whereas in AI alignment, we don't have the ability to get feedback from proposed breakages of alignment schemes.
For more, see Quintin's post section on the difference between AI safety and computer security in regards to epistemics, and a worked example of an attempted security break, where there is suggestive evidence that inner misaligned models/optimization daemons go away as we increase the amount of dimensions.
(Where Quintin Pope talks about the fact that alignment doesn't have good feedback loops on ground truth on "What is an attempted break?", and the example of a claimed break actually went away as the dimensions was scaled up, and note that the disconfirmatory evidence was more realistic than the attempted break.)
This is why I disagreed with Jeffrey Ladish about the security mindset on Twitter: I believe it's a trap for those not possessing technical knowledge, like a lot of LWers, and there are massive differences between AI and computer security that means most attempted connections fail.
https://twitter.com/JeffLadish/status/1712262020438131062
uh I guess I hope he reads enough to internalize the security mindset?? (Jeffrey Ladish)
https://twitter.com/SharmakeFarah14/status/1712264530829492518
I generally tend to think the security mindset is a trap, because ML/AI alignment is very different from rocket engineering or cybersecurity.
For a primer on why, read @QuintinPope5's post section on it:
So now that I've tried to show why porting over the security mindset is flawed, I want to talk about a class of adversaries like gradient hackers or inner-misaligned mesa-optimization, and why I believe this is actually very difficult to do against SGD, and even the non-platonic ideal version of SGD, we can detect most mesa-optimizers quite easily.
I'll be taking the inner misalignment definition from Evan Hubinger's post The Inner Alignment Problem:
https://www.lesswrong.com/posts/pL56xPoniLvtMDQ4J/the-inner-alignment-problem
We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem.
The basic reason why it's hard for a misaligned mesa-optimizer to stick around for long is because Gradient Descent is in fact, much more powerful and white-boxy than people realize, and in particular it has 5 defenses that any mesa-optimizer would need to overcome in order to misalign it:
https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult
1.) Everything is optimized simultaneously and independently. The exact local sensitivity to of each and every parameter to the loss is computed in the backward pass and every parameter is updated based on this. This means it is impossible for a mesaoptimizer or gradient hacker (as long as it is differentiably connected to the network) to shield any part of itself from gradient updates. This includes the gradient hacker's internal circuitry, its connections with other parts of the network, its connections to the network's input data, and the connections between the rest of the network and the loss.
2.) Gradient descent is maximal. Each parameter is updated in the direction of steepest descent. This means that, unlike biological life, gradient hackers cannot 'burn' something to increase entropy to be traded off against decreasing entropy elsewhere. Only the local maximally increasing entropy direction is allowed.
3.) It is stateless. Each parameter is updated according to its local sensitivity to the loss for each datapoint. Credit is not propagated across time. It is impossible to 'bargain' with gradient descent by increasing loss now but decreasing loss more later, or having decreased loss in the past.
4.) It is first-order only. It only computes partial derivatives of the parameter to the loss. It does not take into account potential second order interactions. While bad in some cases, this makes gradient descent immune to 'blackmail' or 'tripwires' setup by mesaoptimizers where optimizing some part of their circuit would potentially cause a huge loss as a second order effect. GD will blithely trigger the tripwire and then optimize away the consequences. This also disallows 'trade' where a gradient hacker could cause low loss on one causal branch in exchange for high loss on some other branch. Gradient descent will instead sum the steepest descent direction on both branches.
5.) It is infinitesimal. GD only computes the infinitesimal local sensitivities of each parameter regardless of the actual learning rate or step-size in practice. This means gradient descent does not 'see' if you are standing next to a giant cliff, so cannot be blackmailed in this way either.
Basically it will optimize the entire causal graph, and leave no slack, and as a bonus are extremely resistant to blackmail by mesa-optimizers. In general, a big part of my optimism around inner alignment is that SGD is extraordinarily good at credit assignment, and it has quite strong correction features in the case that a mesa-optimizer does attempt to misalign it.
We also can detect most mesa-optimizers in the AI without the need for mechanistic interpretability, like so:
While it might be possible to construct gradient hackers around these features, the advantages to us is that all of these approaches (except perhaps 2) should be visible to us even without detailed mechanistic interpretability knowledge of the network. We know a-priori where the stop-gradients or non-differentiable parts of the architecture are. We can straightforwardly observe what gradients are being clipped -- and if the gradient updates to the same set of parameters is being constantly clipped that could be a red flag. We can also find regions of high curvature in the loss landscape by just inspecting the Hessians or approximations of it. Only a gradient hacker which utilizes correlations across minibatches to cancel out its interventions in model behaviour seems difficult to detect.
One caveat here is that the prevention of mesa-optimizers applies fully only to SSL learning on IID data, which is an unfortunate limitation, albeit I do expect SGD to still be ridiculously good at credit assignment even in the RL context.
In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.
Also of importance is that this post only applies to the classic supervised/self-supervised learning on i.i.d data which most current foundation models are trained in. The important things about this paradigm is that the outputs of the model have no causal link to the data distribution -- i.e. the model cannot influence what future data it will be trained on (except potentially highly indirectly influencing successor models [1]). In an RL context, or supervised learning with adaptive data sampling, where there is a link between the model's outputs and the future data distribution, then this provides another route for gradient hackers to operate -- by creating outputs which will steer the data distribution in a desired direction which will strengthen the gradient hacker.
But there's also weak evidence that optimization daemons/demons, often called inner misaligned models, go away when you increase the dimension count:
Another poster (ironically using the handle "DaemonicSigil") then found a scenario in which gradient descent does form an optimization demon. However, the scenario in question is extremely unnatural, and not at all like those found in normal deep learning practice. So no one knew whether this represented a valid "proof of concept" that realistic deep learning systems would develop optimization demons.
Roughly two and a half years later, Ulisse Mini would make DaemonicSigil's scenario a bit more like those found in deep learning by increasing the number of dimensions from 16 to 1000 (still vastly smaller than any realistic deep learning system), which produced very different results, and weakly suggested that more dimensions do reduce demon formation.
This was actually a crux in a discussion between me and David Xu about inner alignment, where I argued that the sharp left turn conditions don't exist in AI development, and he argued that misalignment happens when there are gaps that go uncorrected, which is likely referring to the gap between the base goal like SGD and the internal optimizer's goal that leads to inner misalignment, and I argued that inner misalignment is likely to be extremely difficult to do, due to SGD being able to correct the gap between the inner and outer mesa-optimizer in most cases, and I now showed the argument in this post:
Twitter conversation below:
https://twitter.com/davidxu90/status/1712567663401238742
Speaking as someone who's read that post (alongside most of Quintin's others) and who still finds his basic argument unconvincing, I can say that my issue is that I don't buy his characterization of the doom argument—e.g. I disagree that there needs to be a "vast gap". (David Xu)
https://twitter.com/davidxu90/status/1712568155959362014
SGD is not the kind of thing where you need "vast gaps" between the inner and outer optimizer to get misalignment; on my model, misalignment happens whenever gaps appear that go uncorrected, since uncorrected gaps will tend to grow alongside capabilities/coherence. (David Xu)
https://twitter.com/SharmakeFarah14/status/1712573782773108737
since uncorrected gaps will tend to grow alongside capabilities/coherence.
This is definitely what I don't expect, and part of that is because I expect that uncorrected inner misalignment will be squashed out by SGD unless extreme things happen:
https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult (Myself)
https://twitter.com/davidxu90/status/1712575172124033352
Yes, that definitely sounds cruxy—you expect SGD to contain corrective mechanisms by default, whereas I don't. This seems like a stronger claim than "SGD is different from evolution", however, and I don't think I've seen good arguments made for it. (David Xu)
This reminds me, I should address that other conversation I had with David Xu on how strong priors do we need to encode to ensure alignment, vs how much can we let it learn and it leading to a good outcome, or alternatively how much do we need to specify upfront. And that leads to...
Equivalently speaking, I expect the cost of specification of values to be relatively low, and that a lot of the complexity is offloadable to the learning process.
This was another crux between David Xu and me, specifically on the question of whether you can largely get away with weak priors, or do you actually need to encode a lot stronger prior to prevent misalignment? It ultimately boiled down to the crux that I expected reasonably weak priors to be enough, guided by the innate reward system.
A big part of my reasoning here has to do with the fact that a lot of values and biases are inaccessible by the genome, and that means that you can't directly specify them. You can shape them via setting up training algorithms and data, but it turns out that it's very difficult to directly specify things like values, for instance in the genome. This is primarily because the genome does not have direct access to the world model or the brain, which would be required to hardcode the prior. To the extent that it can, it has to be over relatively simple properties, which means that you need to get alignment with relatively weak priors encoded, and the innate reward system generally does this fantastically, with examples of misalignment being rare and mild.
The fact that humans can reliably get values like "having empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc", without requiring the genome to hardcode a lot of prior information, and getting away with reasonably weak priors is a rather underappreciated thing, since it means that we don't need to specify our values very much, and thus we can reliably offload most of the value learning work to AI.
Here are some posts and comments below:
(I want to point out that it's not just that with weak prior information that the genome can reliably bind humans to real-enough things such that for example, they don't die from thirst from drinking fake water, but that it can create the innate reward system which uses simple update rules to reliably get nearly every person on earth to have empathy for their family and ingroup, revenge when others harmed us, etc, and the rare exceptions to the pattern are rather rare and usually mild alignment failures at best. That's a source of a lot of my optimism on AI safety and alignment.)
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
Here is the compressed conversation between David Xu and me:
https://twitter.com/davidxu90/status/1713102210354294936
(And the reason I'd be more optimistic there is basically because I expect the human has meta-priors I'd endorse, causing them to extrapolate in a "good" way, and reach a policy similar to one I myself would reach under similar augmentation.) (David Xu)
https://twitter.com/davidxu90/status/1713230086730862731
(In reality, of course, I disagree with the framing in both cases: "two different systems" isn't correct, because the genetic information that evolution was working with in fact does encode fairly strong priors, as I mentioned upthread.) (David Xu)
https://twitter.com/SharmakeFarah14/status/1713232260827095119
My disagreement is that I expect the genetic priors to be quite weak, and that a lot of values are learned, not encoded in priors, because values are inaccessible to the genome:
Maybe we will eventually be able to hardcode it, but we don't need that. (Myself)
https://twitter.com/davidxu90/status/1713232760637358547
Values aren't "learned", "inferred", or any other words that suggests they're directly imbibed from the training data, because values aren't constrained by training data alone; if this were false, it would imply the orthogonality thesis is false. (David Xu)
I'm going to reply in this post and say that the orthogonality thesis is a lot like the no free lunch theorem: An extraordinarily powerful result that is too general to apply, because it only applies to the space of all logically possible AIs, and it only works if you have 0 prior that's applied, which in this case would require you to specify everything, including the values of the system, or at best use stuff like brute force search or memorization algorithms.
I have a very similar attitude to "Most goals in the space of goal space are bad." I'd probably agree in the most general sense, but that even weak priors can prevent most goals from being bad, and thus I suspect that a 0 prior condition is likely necessary. But I'm not arguing that with 0 prior, models are aligned with people without specifying everything. I'm arguing that we can get away with reasonably weak priors, and let within life-time learning do the rest.
Once you introduce even weak priors to the situation, then the issue is basically resolved, and I stated that weak priors work to induce learning of values, and it's consistent with the orthogonality thesis to have arbitrarily tiny prior information be necessary to learn alignment.
I could make an analogous argument for capabilities, and I'd be demonstrably wrong, since the conclusion doesn't hold.
This is why I hate the orthogonality thesis, despite rationalists being right on it: It allows for too many outcomes, and any inference like values aren't learned can't be supported based on the orthogonality thesis.
https://twitter.com/SharmakeFarah14/status/1713234214391255277
The problem with the orthogonality thesis is that it allows for too many outcomes, and notice I said the genetic prior is weak, not non-existent, which would be compatible with the orthogonality thesis. (Myself)
https://twitter.com/davidxu90/status/1713234707272626653
The orthogonality thesis, as originally deployed, isn't meant as a tool to predict outcomes, but to counter arguments (pretty much) like the ones being made here: encountering "good" training data doesn't constrain motivations. Beyond that the thesis doesn't say much. (David Xu)
https://twitter.com/SharmakeFarah14/status/1713236849873891699
I suspect it's true when looking at the multiverse of AIs as a whole, then it's true, if we impose 0 prior, but even weak priors start to constrain your motivations a lot. I have more faith in weak priors + whiteboxness working out than you do. (Myself)
https://twitter.com/davidxu90/status/1713237355501584857
I have more faith in weak priors + whiteboxness working out than you do.
I agree that something in the vicinity of this is likely [a] crux. (David Xu)
https://twitter.com/davidxu90/status/1713238995893912060
TBC, I do think it's logically possible for the NN landscape to be s.t. everything I've said is untrue, and that good minds abound given good data. I don't think this is likely a priori, and I don't think Quintin's arguments shift me very much, but I admit it's possible. (David Xu)
##My own algorithm for how to do AI alignment
This is a subpoint, but for those that want to have a ready-to-go alignment plan, here it is:
Implement a weak prior over goal space.
Use DPO, RLHF, or something else to create a preference model.
Create a custom loss function for the preference model.
Use the backpropagation algorithm to optimize it and achieve a low loss.
Repeat the backpropagation algorithm until you achieve an acceptable solution.
Now that I'm basically finished with laying out the arguments and the conversations, lets move on to the conclusion:
My optimism on AI safety stems from a variety of sources. The reasons are, in order of the post, not ordered by importance are:
I don't believe the sharp left turn is anywhere near as general as Nate Soares puts it, because the conditions that caused a sharp left turn in humans was basically cultural learning in humans being able to optimize over much faster time-scales than evolution could respond, evolution not course-correcting us, and being able to transmit OOMs more information via culture through the generations than evolution could. None of these conditions hold for modern AI development.
I don't believe that Nate's example of misgeneralizing the goal of IGF actually works as an actual example of misgeneralization that matters for our purposes, because they were not that 1 AI is trained for a goal in environment A, and then in environment B, it does not pursue the goal, but instead pursues a different goal competently.
Instead, what's happening is that 1 human generation, or 1 human is trained in Environment A, and then a fresh generation of humans is trained on a different distribution, which predictably will have more divergence than the first case.
In particular, there's no reason to be concerned about the alignment of AI misgeneralizing, since we have no reason to assume that the central example of Lesswrong is actually misgeneralization. From Quintin:
If we assume that humans were "trained" in the ancestral environment to pursue gazelle meat and such, and then "deployed" into the modern environment where we pursued ice cream instead, then that's an example where behavior in training completely fails to predict behavior in deployment.
If there are actually two different sets of training "runs", one set trained in the ancestral environment where the humans were rewarded for pursuing gazelles, and one set trained in the modern environment where the humans were rewarded for pursuing ice cream, then the fact that humans from the latter set tend to like ice cream is no surprise at all.
In particular, this outcome doesn't tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they'll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.
AIs are mostly white boxes, at the very least, and the control over AI that we have means that a better analogy is through our innate reward systems, which align us to quite a lot of goals spectacularly well, so well that the total evidence of alignment could easily put X-risk or even say, killing a human 5-15+ OOMs or less, which would make the alignment problem a non-problem for our purposes. It would pretty much single-handedly make AI misuse the biggest problem, but that issue has different solutions, and governments are likely to regulate AI misuse anyway, so existential risk gets cut 10-99%+ or more.
I believe the security mindset is inappropriate for AI due to the fact that aligning AI mostly doesn't involve dealing with adversarial intelligences or inputs, and the reason turns out to be that the most natural class, inner misaligned mesa-optimizers/optimization daemons mostly doesn't exist, because of my next reason. Also alignment is in a different epistemic state to computer security, and there are other disanalogies that make porting intuitions from other fields into ML/AI research very difficult to do correctly.
It is actually really difficult to inner misalign the AI, since SGD is really good at credit assignment, and optimizes the entire causal graph leading to the loss, leaving no slack. It's not like evolution where you have to do this from Gwern's post here:
Imagine trying to run a business in which the only feedback given is whether you go bankrupt or not. In running that business, you make millions or billions of decisions, to adopt a particular model, rent a particular store, advertise this or that, hire one person out of scores of applicants, assign them this or that task to make many decisions of their own (which may in turn require decisions to be made by still others), and so on, extended over many years. At the end, you turn a healthy profit, or go bankrupt. So you get 1 bit of feedback, which must be split over billions of decisions. When a company goes bankrupt, what killed it? Hiring the wrong accountant? The CEO not investing enough in R&D? Random geopolitical events? New government regulations? Putting its HQ in the wrong city? Just a generalized inefficiency? How would you know which decisions were good and which were bad? How do you solve the “credit assignment problem”?
The way SGD solves this problem is by running backprop, which is a white-box algorithm, and Nora Belrose explains it more here:
And that's the base optimizer, not the mesa-optimizer, which is why SGD has a chance to correct the inner-misaligned agent far more effectively than cultural/biological evolution, the free market, etc. It is white-box, like the inner optimizers it runs, and solves credit assignment in a much better way than the previous optimizers like cultural/biological evolution, the free market, etc could hope to do.
So now that we have listed the reasons why I expect optimism on AI safety, I'll add 1 new mini-section to show that the shutdown problem from AI is almost solved.
It turns out that we can keep the most useful aspects of Expected Utility Maximization while making an AI shutdownable.
Sami Petersen showed that we can integrate incomplete preferences to AIs while weakening transitivity just enough to get a non-trivial theory of Expected Utility Maximization that's quite a lot safer. Elliott Thornley proposed that incomplete preferences would be used to solve the shut-down problem, and the very nice thing about subagent models of Expected Utility Maximization is that they require a unanimous committee in order for a decision to be accepted as a sure gain.
This is both useful, but can lead to problems. On the one hand, we only need one expected utility maximizer that wants to be able to shut down the AI in order for us to shut it down as a whole, but we would need to be sort of careful on where their execution conditions/domain is, as unanimous committees can terrible because only one agent needs to do something to grind the entire system to a halt, which is why in the real world, it's usually not a preferred way to govern something.
Nevertheless, for AI safety purposes, this is still very, very useful, and if it grows up to have broader conditions than the ones outlined in the posts below, this might be the single biggest MIRI success of the last 15 years, which is ridiculously good.
Edit 3: I've removed addendum 2 as I think it's mostly irrelevant, and Daniel Kokotajlo showed me that Ajeya actually expects things to slow down in the next few years, so the section really didn't make that much sense.