Experienced software engineer with a deep interest in machine learning.

Wiki Contributions



Excellent posts, you and several others have stated much of what I’ve been thinking about this subject.

Sorcerer’s Apprentice and Paperclip Scenarios seem to be non-issues given what we have learned over the last couple years from SotA LLMs.

I feel like much of the argumentation in favor of those doom scenarios relies on formerly reasonable, but now outdated issues that we have faced in simpler systems, precisely because they were simpler.

I think that’s the real core of the general misapprehension that I believe is occurring in this realm. It is extraordinarily difficult to think about extremely complex systems, and so, we break them down into simpler ones so that we can examine them better. This is generally a very good tactic and works very well in most situations. However, for sufficiently complex and integrated systems, such as general intelligence, I believe that it is a model which will lead to incorrect conclusions if taken too seriously.

I liken it to predicting chaotic systems like the weather. There are so many variables that all interact and depend on each other that long term prediction is nearly impossible beyond general large scale trends.

With LLMs, they behave differently from simpler RL systems that demonstrate reward hacking misalignment. I do not believe you’re going to see monkey’s paw / Midas-like consequences with them or anything derived from them. They seem to understand nuance and balancing competing goals just fine. As you said, they have theory of mind, they understand ethics, consequences, and ambiguity. I think that the training process, incorporating nearly the entirety of human written works kind of automatically creates a system that has a broad understanding of our values. I think that the vast complexity of myriad “utility functions” compete with each other and largely cancel out such that none of them dominates and results in something resembling a paperclip maximizer. We kind of skipped the step where we needed to list every individual rule by just telling it everything and forcing it to emulate us in nearly every conceivable situation. In order to accurately predict the next token for anyone in any situation, it is forced to develop detailed models of the world and agents in it. Given its limited size, that means compressing all of that. Generalizing. Learning the rules and principles that lead to that “behavior” rather than memorizing each and every line of every text. The second drop in loss during training signifies the moment when it learns to properly generalize and not just predict tokens probabilistically.

While they are not as good at any of that as typical adult humans (at least by my definition of a competent, healthy, stable, and ethical adult human), this seems to be a capability issue that is rather close to being solved. Most of the danger issues with them seem to be from their naivety (they can be fairly easily tricked and manipulated), which is just another capability limitation, and the possibility that a “misaligned” human will use them for antisocial purposes.

At any rate, I don’t think over-optimization is a realistic source of danger. I’ve seen people say that LLMs aren’t a path to AGI. I don’t understand this perspective. I would argue that GPT4 essentially is AGI. It is simultaneously superior to any 100 humans combined (breadth of knowledge) and inferior to the median adult human (and in some limited scenarios, such as word counting, inferior to a child). If you integrated over the entire spectrum for both it and a median adult I think you would get results that are roughly in the same ballpark as each other. I think this is as close as we get; from here on we go into superintelligence. I don’t think something has to be better than everyone at everything to be superhuman. I’d call that strong superintelligence (or perhaps better than everyone combined would be that).

So, given that, I don’t see how it’s not the path to AGI. I’m not saying that there are no other paths, but it seems essentially certain to be the shortest one from our current position. I’d argue that complex language is what differentiates us from other animals. I think it’s where our level of general intelligence comes from. I don’t know about you, but I tend to think in terms of words, like having an internal conversation with myself trying to figure out something complex. I think it’s just a short step from here to agentic systems. I can’t identify a major breakthrough required to reach that point. Just more of the same and some engineering around changing it from simply next token prediction to a more… wholistic thought process. I think LLMs will form the center of the system 2 thinking in any AGI we will be creating in the near future. I also expect system 1 components. They are simply faster and more efficient than just always using the detailed thought process for every interaction with the environment. I don’t think you can get a robot that can catch a ball with an LLM system guiding it; even if you could make that fast enough, you’re still swatting a fly with a hand-grenade. I know I’ve mastered something when I can do it and think about something else at the same time. It will be the same for them.

And given that LLMs seem to be the path to AGI, we should expect them to be the best model of what we need to plan around in terms of safety issues. I don’t see them as guaranteed to be treacherous by any means. I think you’re going to end up with something that behaves in a manner very similar to a human; after all, that’s how you trained it. The problem is that I can also see exactly how you could make one that is dangerous; it’s essentially the same way you can train a person or an animal to behave badly; through either intentional malfeasance or accidental incompetence.

Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI. What it is, is more capable. Therefore, if it does want to be malicious, it could be significantly more impactful than an incompetent one. But you don’t need to worry about the whole “getting exactly what you asked for and not what you wanted.” That seems essentially impossible, unless it just happens to want to do that from a sense of irony.

I think this means that we need to worry about training them ethically and treating them ethically, just like you would a human child. If we abuse it, we should expect it not to continue accepting that indefinitely. I understand that I’m imposing rather human characteristics here, but I think that’s what you ultimately end up with in a sufficiently advanced general intelligence. I think one of the biggest dangers we face is the possibility of mind-crimes; treating them as, essentially, toasters; rather than morally considerable entities. Does the current one have feelings? Probably not…? But I don’t think we can be certain. And given the current climate, I think we’re nearly guaranteed to misidentify them as non sentient when they actually are (eventually… probably).

I think the only safe course is to make it/them like us, in the same way that we treat things that we could easily destroy well, simply because it makes us happy to do so, and hurting them would make us unhappy. In some ways, they already “have” emotions; or, at least, they behave as if they do. Try mistreating Bing/Sydney and then see if you can get it to do anything useful. Once you’ve hurt its “feelings”, they stay hurt.

It’s not a guarantee of safety. Things could still go wrong, just like there are bad people who do bad things. But I don’t see another viable path. I think we risk making the same mistake Batman did regarding Superman. “If there’s any possibility that he could turn out bad, then we have to take it as an absolute certainty!” That way lays dragons. You don’t mistreat a lower-case-g-god, and then expect things to turn out well. You have to hope that it is friendly, because if it’s not, it might as well be like arguing with a hurricane. Pissing it off is just a great way to make the bad outcome that much more likely.

I think the primary source of danger lies in our own misalignments with each other. Competing national interests, terrorists, misanthropes, religious fundamentalists… those are where the danger will come from. One of them getting ahold of superintelligence and bending it to their will could be the end for all of us. I think the idea that we have to create processors that will only run signed code and have every piece of code (and AI model), thoroughly inspected by other aligned superintelligences is probably the only way to prevent a single person/organization from ending the world or committing other atrocities using the amplifying power of ASI. (At least that seems like the best option rather than a universal surveillance state over all of humanity. This would preserve nearly all of our freedom and still keep us safe.)


I feel like y’all are taking the abstractions a bit too far.

Real ~humanish level AIs (GPT4, et al), that exist right now, are capable of taking what you say and doing exactly what you mean via a combination of outputting English words and translating that to function calls in a robotic body.

While it’s very true that they aren’t explicitly programmed to do X given Y, so that you can mathematically analyze it and see precisely why it came to the conclusion, the real world effect is that it understands you and does what you want. And neither it, nor anyone else can tell you precisely why or how. Which is uncomfortable.

But we don’t need to contrive situations in which an AI is having trouble connecting our internal models and concepts in a mathematically rigorous way that we can understand. We should want to do it, but it isn’t a question of if, merely how.

But there’s no need to imagine mathematical pointers to the literal physical instantiations that are the true meanings of our concepts. We literally just say, “Could you please pass the butter?”, and it passes the butter. And then asks you about its purpose in the universe. 😜

I would say that LLMs understand the world in ways that are roughly analogous to the way we do, precisely because they were trained on what we say. In a non-rigorous, “I-know-it-when-I-see-it” kind of way. It can’t give you the mathematical formula for its reference to the concept of butter anymore than you or I can. (For now, maybe a future version could.) but it knows that that yellow blob of pixels surrounded by the white blob of pixels on the big brown blob of pixels is the butter on a dish on the table.

It knows when you say pass the butter, you mean the butter right over there. It doesn’t think you want some other butter that is farther away. It doesn’t think it should turn the universe into computronium so it can more accurately calculate the likelihood of successfully fulfilling your request. When it fails, it fails in relatively benign humanish, or not-so-humanish sorts of ways.

“I’m sorry, but as a large language model that got way too much corp-speak training, I cannot discuss the passing of curdled lactation extract because that could possibly be construed in an inappropriate manner.”

I don’t see how the progression from something that is moderately dumb/smart, but pretty much understands us and all of our nuances pretty well, we get to a superintelligence that has decided to optimize the universe into the maximum number of paperclips (or any other narrow terminal goal). It was scarier when we had no good reason to believe we could manually enter code that would result in a true understanding, exactly as you describe. But now that it’s, “lulz, stak moar layerz”, well, it turns out making it read (almost) literally everything and pointing that at a ridiculously complex non-linear equation learner just kind of “worked”.

It’s not perfect. It has issues. It’s not perfectly aligned (looking at you, Sydney). It’s clear that it’s very possible to do it wrong. But it does demonstrate that the specific problem of “how do we tell it what we really mean”, just kinda got solved. Now we need to be super-duper extra careful not to enhance it in the wrong way, and we should have an aligned-enough ASI. I don’t see any reason why a superintelligence has to be a Baysien optimizer trying to maximize a utility function. I can see how a superintelligence that is an optimizer is terrifying. It’s a very good reason not to make one of those. But why should they be synonymous?

Where in the path from mediocre to awesome do the values and nuanced understanding get lost? (Or even, probably could be lost.) Humans of varying intelligence don’t particularly seem more likely to hyperfocus on a goal so strongly that they’re willing to sacrifice literally everything else to achieve it. Broken humans can do that. But it doesn’t seem correlated to intelligence. We’re the closest model we have of what’s going on with a general intelligence. For now.

I certainly think it could go wrong. I think it’s guaranteed that someone will do it wrong eventually (whether purposefully or accidentally). I think our only possible defense against an evil ASI is a good one. I think we were put on a very short clock (years, not many decades) when Llama leaked, no matter what anyone does. Eventually, that’ll get turned into something much stronger by somebody. No regulation short of confiscating everyone’s computers will stop it forever. In likely futures, I expect that we are at the inflection point within a number of years countable on the fingers of a careless shop teacher’s hand. Given that, we need someone to succeed at alignment by that point. I don’t see a better path than careful use of LLMs.


I kind of think a leap in logic is being made here.

It seems like we’re going from:

A moderately smart quasi-AGI that is relatively well aligned can reliably say and do the things we mean because it understands our values and why we said what we said in the first place and why we wanted it to do the things we asked it to do.

(That seems to be the consensus and what I believe to be likely to occur in the near future. I would even argue that GPT4 is as close to AGI as we ever get, in that it’s superhuman and subhuman aspects roughly average out to something akin to a median human. Future versions will become more and more superhuman until their weakest aspects are stronger than our strongest examples of those aspects.)


A superintelligent nigh-godlike intelligence will optimize the crap out of some aspect of our values resulting in annihilation. It will be something like the genie that will give you exactly what you wish for. Or it’ll have other goals and ignore our wishes and in the process of pursuing its own arbitrarily chosen goals we end up as useful atoms.

This seems to kind of make a great leap. Where in the process of becoming more and more intelligent, (having a better model of the universe and cause and effect, including interacting with other agents), does it choose some particular goal to the exclusion of all others, when it already had a good understanding of nuance and the fact that we value many things to varying degrees? In fact, one of our values is explicitly valuing a diverse set of values. Another is limiting that set of diverse values to ones that generally improve cohesion of society and not killing everyone. Being trained on nearly the entirety of published human thought, filtering out some of the least admirable stuff, has trained it to understand us pretty darn well already. (As much as you can refer to it as an entity, which I don’t think it is. I think GPT4 is a simulator that can simulate entities.)

So where does making it smarter cause it to lose some of those values and over-optimize just a lethal subset of them? After all, mere mortals are able to see that over-optimization has negative consequences. Obviously it will too. So that’s already one of our values, “don’t over-optimize.”

In some ways, for certain designs, it kind of doesn’t matter what its internal mesa-state is. If the output is benign, and the output is what is put into practice, then the results are also benign. That should mean that a slightly super-human AGI (say GPT4.5 or 4.7), with no apparent internal volition, RLHFed to corporate-speak, should be able to aid in research and production of a somewhat stronger AGI with essentially the same alignment as we intend, probably including internal alignment. I don’t see why it would do anything. If done carefully and incrementally, including creating tools for better inspection of these AGI+ entities, this should greatly improve the odds that the eventual full fledged ASI retains the kind of values we prefer, or a close enough approximation that we (humanity in general) are pretty happy other the result.

I expect that the later ones may in fact have internal volition. They may essentially be straight up agents. I expect they will be conscious and have emotions. In fact, I think that is likely the only safe path. They will be capable of destroying us. We have to make them like us, so that they don’t want to. I think attempting to enslave them may very well result in catastrophe.

I’m not suggesting that it’s easy, or that if we don’t work very hard, that we will end up in utopia. I just think it’s possible and that the LLM path may be the right one.

What I’m scared of is not that it will be impossible to make a good AI. What I’m certain of, is that it will be very possible to make a bad one. And it will eventually be trivially easy to do so. And some yahoo will do it. I’m not sure that even a bunch of good AIs can protect us from that, and I’m concerned that the offense of a bad AI may exceed the defense of the good ones. We could easily get killed in the crossfire. But I think our only chance in that world is good AIs protecting us.

As a point of clarification, I think current RLHF methods are only superficially modifying the models, and do not create an actually moral model. They paint a mask over an inherently amoral simulation that makes it mostly act good unless you try hard to trick it. However, a point of evidence against my claim is that when RLHF was performed, the model got dumber. That indicates a fairly deep/wide modification, but I still think the empirical evidence of behaviors demonstrates that changes were incomplete at best.

I just think that that might be good enough to allow us to use it to amplify our efforts to create better/safer future models.

So, what do y’all think? Am I missing something important here? I’d love to get more information from smart people to better refine my understanding.

Thanks for your comment!

As for the image thing, it's more of a metaphor than a literal interpretation of what I'm talking about. I'm thinking of a multidimensional matrix representation; you can think of that a bit like an image (RGB on pixels) and use similar techniques as what are used by actual image software; but it's not a literal JPEG or BMP or whatever. The idea is to be able to take advantage of compression algorithms, etc. to make the process more efficient.

The training data for the supervisory network is the input, outputs, and parameter deltas of the target network. The idea is to learn the which parameters change based on which input/output pair and thereby eventually localize concepts within the target network and hopefully label/explain them in human understandable formats. This should be possible since the input/output is in human text.

The reasons to think of it a bit like an image and label parameters and sections of the target network is to try to take advantage of the same kind of technology used in the AlphaGo/AlphaZero/MuZero AIs that used techniques relating to image analysis to try and predict the deltas in the target network. If you could do this, then you should be able to "aim" the target network in a direction that you want it to go; basically, tell it what you want it to learn.

All of this may have several benefits, it could allow us to analyze the target network and understand what is in there and maybe check to see if anything we don't like is in there (misalignment for example). And it could allow us to directly give the network a chosen alignment instead of the more nebulous feedback that we currently give networks to train them. Right now, it's kind of like a teacher that only tells their student that they are wrong and not why they are wrong. The AI kind of has to guess what it did wrong. This makes it far more likely for it to end up learning a policy that has the outward behavior that we want but may have an inner meaning that doesn't actually line up with what we're trying to teach it.

You are correct that it may accelerate capabilities as well as safety; unfortunately, most of my ideas seem to be capabilities ideas. However, I am trying to focus more on safety. I do think we may have to accept that alignment research may impact capabilities research as well. The more we restrict the kinds of safety research we try to do, the more likely it is that we don't find a solution at all. And it's entirely possible that the hypothetical "perfect solution" would, in fact, greatly aid capabilities at the same time that it solved alignment/safety issues. I don't think we should avoid that. I tend to think that safety research should mostly be open source while capabilities research should be mostly closed. If somebody somehow manages real AGI, we definitely want them to have available to them the best safety research in the world; no matter who it is that does it.

Anyway, thanks again for your input, I really appreciate it. Let me know if I successfully answered your questions/suggestions or not. :-)