Michael Simkin — LessWrong

On the Impossibility of Intelligent Paperclip Maximizers

If the AI has no clear understanding what is he doing and why, he doesn't have a wider world view of why and who to kill and who not, how would one ensure military AI will not turn against him? You can operate a tank and kill the enemy with ASI, you will not win a war without traits of more general intelligence, and those traits will also justify (or not) the war, and its reasoning. Giving a limited goal without context, especially gray area ethical goal that is expected to be obeyed without questioning can be expected from ASI not true intelligence. You can operate an AI in very limited scope this way.

The moral reasoning of reducing suffering has nothing to do with humans. Suffering is bad not because of some sort of randomly chosen axioms of "ought", suffering is bad because anyone who suffering is objectively in negative state of being. This is not a subjective abstraction... suffering can be attributed to many creatures, and while human suffering is more complex and deeper, it's not limited to humans.

On the Impossibility of Intelligent Paperclip Maximizers

Michael Simkin3y*10

It's not only can't doubt its own goal - but it also can't logically justify its own goal, it can't read book on ethics and change his perspective on its own goal, or simply realize how dumb this goal is. It can't find a coherent way to explain to itself its role in the universe or why this goal is important, like for example an alternative goal to preserve life and reduce suffering. It doesn't require to be coherent with itself, and incapable to estimate how its goal compares with other goals and ethical principles. It's just lacking the basics of rational thinking.

A series of ASI is not an AGI - it will lack the basic ability to "think critically" and the lack of many other intelligence traits will limit its mental capacity. It will just execute a series of actions to reach a certain goal, without any context. A bunch of "chess engines", acting in a more complex environment.

I would claim that an army of robots based on ASIs will generally lose to an army of robots based on true AGI. Why? Because intelligence is very complex thing that gives advantages in unforeseen ways, and is also used for tactical command on the battlefield, as well as all war logistics etc. You need to have a big picture; you need to be able to connect a lot of seemingly unconnected dots, you need traits like creativity, imagination, thinking outside the box, you need to know your limitation and delegate some tasks while focusing on others, this means you need a well-established goal prioritization mechanism, and you need to be able to think about them rationally. You can't treat the whole universe just as a bunch of small goals solved by "chess engines", there is too much non-trivial interconnectedness between different components that an ASI will not be able to notice. True intelligence has a lot of features, that gives it the upper hand, over "series of specialized engines", in a complex environment like earth.

The reason why people would lose to an army of robots based on ASIs, is because we are inherently limited in our information processing speed, thus we can't think fast enough and come up with better solutions than an army of robots. But an AGI that will not be limited in its information processing just like the ASIs, will generally win.

The idea that intelligence will be limited if the goals are somewhat irrational, and therefor will be weaker/limited in intelligence vs "machines" with more well established and rational goals, gives some hope that this whole AI thing is way less dangerous than we think. For example, military robots whose goal is to protect interests of some nation, will not be compatible with an AGI, while robot that is protecting human life - will, or at least it might be way more intelligent.

Would you agree that an AI that is maximizing paperclips does make intellectual mistake?

I was focused on the idea that intelligence is not orthogonal to goals. And dumb goals are contradicting basic features of intelligence. There could be "smart goals" that are contradicting human interests, this is true, I can't cover everything in one post. But the conclusion would be that we are to program the robots and "convince them" in a way, that they should protect us. They might be either "not convinced" or "not a true Intelligence", thus the level of intelligence is limited by the goal we present to it. I don't think I've heard this notion previously, and it's important idea - because it set a boundary on several intelligence features as function of the goal the algorithm set to optimize.

Another crucial point is that intelligence research even without alignment research, will still converge to something within a set of rational "meta goals". Those goals indeed might not be aligned with humanity well being (and therefor we need alignment research), but the goal set is still pretty limited and some random highly irrational goals will be dismissed due to high intelligence of the systems. This means that we need to deal with very limited set of "meta-thinking", prioritizing one rational goal over just few other rational ones. In a way, we need to guide it to a specific local maximum. I would say in general it's simpler task, over the approach where each goal might be legit. Once again it gives hope, that our engines are much easier to make aligned with meta goals that are pro humans. For example if the engine can reason, it will not suddenly want to kill some human for fun, as part of some "noise", as it will contradict its core value system. So we need to check much less scenarios and increase our trust once we make sure it's aligned.

Predictable updating about AI risk

[+]Michael Simkin3y-16-25

AGI ruin mostly rests on strong claims about alignment and deployment, not about society

Michael Simkin3y10

RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop "self-agendas" such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that support this method and end up killing us all.

Claiming that RLHF is a trial and error approach and therefore poses a risk to humanity is similar to suggesting that airplanes can fall from the sky against the laws of physics because airplane design is a trial and error process, and there is no one solution for the perfect wing shape. Or, it is like saying that a car engine's trial and error approach could result in a sudden nuclear explosion.

It is important to distinguish between what is mathematically proven and what is fictional. Doing so is crucial to avoid wasting time and energy on implausible or even impossible scenarios and to shift our focus to real issues that actually might influence humanity.

AGI deployment as an act of aggression

Michael Simkin3y20

- I meant as a risk of failure to align

Today alignment is so popular that to align a new network is probably easier than training it. It has become so much the norm and part of the training of LLMs, it's like saying some car company has the risk to forget adding wheels to its cars.

This doesn't imply that all alignments are the same or no one could potentially do it wrong, but generally speaking having a misaligned AGI, is very similar to the fear of having a car on the road with square wheels. Today's models aren't AGI and all the new ones are trained with RLHF.

The fear of misalignment is probable in a world where no one thinks about this problem at all. No one develops tools for this purpose, no one opens datasets to train networks to be aligned. This could be a hypothetical possibility, but with the amount of time and effort invested by society into this topic, very improbable.

It's also not so hard - if you can train you can align. If you have any reason to finetune a network, it is very probably concerning the alignment mechanisms that you want to change. That means that most of the networks, and the following AGIs based on them (if this will happen), will be just different variations of alignments. This is not true for closed LLMs, but for them the alignment developed by large companies having much more to lose, will be even more strict.

- if you worked on the Manhattan project you had no right claiming Hiroshima and Nagasaki had nothing to do with you.

In this case I think the truth is somewhere in the middle. I do agree that the danger is inherent in those systems, more inherent than in cars for example. I think paperclips are fictional, and an AGI reinforced on paperclip production, will not make us all paperclips (because he has the skill of doubting his programming, unlike non AGI, while over-producing paperclips is extremely irrational). And during the invention of cars, tanks were a clear possibility as well. And AGI is not a military technology, that means that the inventor could honestly believe that most people will use an AGI for bettering humanity. Yet still I agree that very probably militaries will use this tech too, I don't see how this is avoidable, in the current state of humanity, where most of our social institutions are based on force and violence.

When you are working on an atomic bomb, the **only** purpose of this project is to drop an atomic bomb on the enemy. This is not true with AGI, the main purpose of AGI is not to make paperclips, nor to weaponize robots, the main purpose is to help people in many neutral or negative situations. Therefore the humans that do use it for military purposes is their choice, and their responsibility.

I would say the AGI inventor is not like Marie Curie or Einstein, and not like someone who is working in the Manhattan project, but more like someone who invented the nuclear fission mechanism. It had two obvious uses - energy production, and bombs. There is still distance to use this mechanism for military purposes, which is obviously going to happen. But also unclear if more people will die from it, than today in wars, or it will be a very good deterrent that causes people not wanting war at all. Just like it was unclear if atomic bombs caused more casualties or less in the long run, because the bombs ended the war.

- Imagine taking a modern state and military and dumping it into the Bronze Age, what do you think would happen to everyone else?

As I said I believe it to be way more gradual, with lots of players and options to train different models. As a developer, I would say there is coding before chatGPT and after. Every new information technology accelerates the research/development process. Before stack-overflow we had books about coding. Before photoshop people used hand drawings. Every modern tech is accelerating the production process of any kind. The first AGIs are not expected to be different, they will accelerate a lot of processes including the process of improving themselves. But this will take a lot of time and resources to implement in practice. Suppose an AGI produces a chip design with 10x greater efficiency through superior hardware design. However, obtaining the resulting chip will require a minimum of six months, and this is not something that the AGI can address. You need to allocate resources of a chip factory to produce the desired design, the factory has limited capacity, it takes time to improve everything. If an AGI wants instead to build a chip factory itself, it will need a lot more resources, and government approvals all come with more time. We are talking here about years. And with some limited computational resources that they will be allocated today, they will not be able to accelerate as much. Yes I believe they could improve everything by say 20%, but it's not what you are talking about, you are talking about accelerating everything by factor of 100, if everyone will have an AGI this might happen faster, but a lot of AGIs with different alignment values, will be able to accelerate mostly in the direction of the common denominator with other AGIs. Just like people, we are stronger when we are collaborating, and we are collaborating when we find a common ground.

My main point is that we have physical bottlenecks - that will create lots of delays in development of any technology except information processing per se, and as long as we have chatbot and not a weapon, I don't have much worries, because it's both a freedom of speech, and if it's aligned chatbot, the damage and acceleration it can cause to the society, is still limited by physical reality, that can't be accelerated by factor of 100, in too short period. Offering sufficient chances and space for competitors and imitators to narrow the gap and present alternative approaches and sets of values.

- There's people who think things are this way because this is how God wants them. Arguably they may even be a majority of all humans.

This was true to other technologies too, and some communities are refusing to use cars and continue to use horses even today, and personally as long as they are not forcing their values on me, I am fine with them using horses and believing God intended the world to stop in the 18th century. Obviously the amount of change with AGI is very different, but my main point here is that just like cars, this technology will be very gradually integrated into society, solving more and more problems that most people will appreciate. While I am not concerned with job loss per se, but with the lack of income for many households, and the social safety net system might not adapt fast enough to this change. Still I view it as a problem that exists only within a very narrow timeframe, society will adapt pretty fast to the change, the moment millions of people will remain without jobs.

- I just don't think AGI would ever deliver those benefits for most of humanity as things stand now.

I don't see why. Our strongest LLMs are currently provided with API. The reason for that is: in order for a project to be developed and integrated into society, it needs a constant income. The best income model is by providing utility for lots of people. This means that most of us will use standard, relatively safe solutions, for our own problems using API. The most annoying feature of LLMs now is censorship. So although I see it as very annoying, I wouldn't say that this will cause a delay in social progress. Other biases are very minor in my opinion. As far as I can tell, LLMs are about to bring the democratization of intelligence. If previously some development cost millions, and could be developed only by giants like Google hiring thousands of workers, tomorrow it will be possible to do it in a garage for a few bucks. As far as I can tell, if the current business model will continue to be implemented, it will most probably benefit most of humanity in many positive ways.

- If those benefits are possible, we can achieve them much more surely and safely, if a bit more slowly, via non-agentic specialized AI tools managed and used by humans.

As I said I don't see a real safety concern here. As long as everything is done properly and it looks like it converges to this state of affairs, the dangers are minimal. And I would strongly disagree that specialized intelligence could solve everything that general intelligence solves. You won't be able to make a good translator, nor automated help centers, nor naturally sound text to speech, not even a moral driver. In order for technology to be fully integrated into human society, in any meaningful way, it will need to understand humans. Virtual doctors, mental health therapists, educators all need natural language skills at a very high level, and there is no such thing as narrowed natural language skills.

I am pretty sure those are not agents in the sense that you imply. Those are basically text completion machines, completing text to be optimally rewarded by some group of people. You could call it agency, but they are not like biological agents, they don't have desires or hidden agendas, self-preservation or ego. They do exhibit traits of intelligence, but not agency in an evolutionary sense. They generate outputs to maximize some reward function, the best way they can. It's very different from humans, we have lots of evolutionary background, that those models simply lack. One can view humans as AGIs trained to maximize their genes survival probability, while LLMs maximize only the satisfaction of humans if trained properly with RLHF. They tend to come out as creatures with a desire to help humans. As far as I can see, we've learned to summon a very nice and friendly Moloch and provide a mathematical proof that it will be friendly if certain training procedures are met, and we are working hard to improve the small details. If you would think about midjourney like as a more intuitive alegory, we have learned to make a very nice pictures from text prompts, but we still have a problem with fingers and textual presentation in the image. To say the AI will want to destroy humanity, is like saying midjourney will consistently draw you a Malevich square when you ask for Mona Lisa. But yes, the AI might be exploited by humans, manipulated by covered evil intents, this possibility is expected to happen to some extent, yet as long as we can ensure the damage is local and caused by a human with ill intent, then we can hope to neutralize him, just like today we have mass shooters, terrorists etc. etc.

- I was thinking mostly of relatively fast take-off scenarios

Notice that it wasn't clear from your title. You are proposing some pretty niche concept of AGI, with a lot of assumptions about it. And then claim that deployment of this specific AGI is an act of aggression. And for this specific narrowed and implausible but possible scenario, someone might agree. But then he will quote your article when he will be talking about LLMs that are obviously moving in different directions regarding both safety and variability, that might actually be way less aggressive, and more targeted to solve humanity problems. You are basically defending terrorists that will bomb computation centers, and they will not get into the nuances, if the historical path of AGI development took the path of this post or not.

While regarding this specific scenario, bombing such an AGI computation center will not help, just like it will not help to run with swords against machine guns. In the unlikely event that your scenario were to occur, we would be unable to defend against the AGI, or the time available to respond would be extremely limited, resulting in a high probability of missing the opportunity to react in time. What will most probably happen, is some terrorist groups will try to target computation centers of civilian infrastructure, which are developing an actual aligned AGI, while military facilities developing AGIs for military purposes will continue to be well guarded, only promoting the development of military technologies instead of civilian.

With the same or even larger probability I would propose a scenario where some aligned pacifist chatbot becomes so rational and convincing, so that people all around the world will be convinced to become pacifist too, opposing any military technology as a whole, de-arming all the nations, producing strong political movement against war and violence of any kind, forcing most democratic nations to stop investing resources into military as a whole. While promoting revolutions in dictatorships, and making them democracies first. A good chatbot with rational and convincing arguments, might cause more social change than we expect. If more people will develop their political views on balanced, rational pacifist LLM, it might reduce violence and wars will be seen as something from the distant past. Although I really want to hope this will be the case, I think the probability of it is similar to the probability of success of bronze age people against machine guns, or of the mentioned bombing to succeed in winning a highly accelerated AGI. It's always nice to have dreams, but I would argue the most beneficial discussion regarding AGI should concern at least somewhat probable scenarios. Single extremely accelerated AGI in a very short period of time - is very unlikely to occur, and if it does, there is very little that can be done against it. This goes along the lines of gray goo, an army of tiny Nano robots that can move atoms in order to self-replicate, and they don't need anything special for reproduction except some kind of material, eventually consuming all of earth. I would recommend distinguishing sci-fi and fantasy scenarios, from most probable scenarios to actually occur in reality. Let's not fear cars, because they might be killing robots disguised as cars, like in Transformers franchise, and care more about actual people that are dying on roads. In the scenario of AGI, I would be more concerned with its military applications, and the power it gives police states, than anything else, including job loss (which in my view is more similar to reduction of forced labor, more reminiscent of the releasing of slaves in the 19th century than a problem).

AGI deployment as an act of aggression

Michael Simkin3y20

- building AGI probably comes with a non-trivial existential risk. This, in itself, is enough for most to consider it an act of aggression;

1. I don't see how aligned AGI comes with existential risk to humanity. It might come as existential risk to groups opposing the value system of the group training the AGI, this is true. For example Al-Kaida will view it as existential risk to itself, but there is no probable existential risk for the groups that are more aligned with the training.

2. There are several more steps from aligned AGI to existential risk to any group of people. You don't only need an AGI, but you need to weaponize it, and promote physical presence that will monitor the execution of the value system of this AGI. Deploying an army of robots that will enforce a value system of an AGI, is very different from just inventing an AGI. Just like bombing civilians from planes, is very different from inventing flight or bombs. We can argue where the aggression act takes place, but most of us will place it in the hands of people that have the resources to build an army of robots for this purpose, and they invest their resources with the intention of enforcing their value system. Just like Marie Curie can't be blamed for an atomic weapon, and her discovery is not an act of aggression, the Wright brothers can't be blamed for all the bombs dropped on civilians from planes.

3. I would expect most deployed robots based on AGI, to be of protective nature not aggressive. That means that nations will use those robots to *defend* themselves and their allies from invaders and not attack. So any measure of aggression in the invading sense, of forcing and invading and breaking the existing social boundaries we created, will contradict the majority of humanity values, and therefore will mean this AGI is not aligned. Yes some aggressive nations might create invading AGIs, but they will probably be a minority, and the invention and deployment of an AGI can't be considered by itself an act of aggression. If aggressive people teach an AGI to be aggressive, and not aligned with the majority of humanity which is protective but not aggressive, then this is on their hands, not the AGI inventor.

- even if the powerful AGI is aligned, there are many scenarios in which its mere existence transforms the world in ways that most people don't desire or agree with; whatever value system it encodes gets an immense boost and essentially Wins Culture; very basic evidence from history suggests that people don't like that;

1. I would argue that initially there would be a lot of different alternatives, all meant to this or that extent to serve the best interest of a collective. Some of the benefits are universal - say people dying of starvation, homelessness, traffic accidents, environmental issues like pollution and waste, diseases, lack of education resources or access to healthcare advice. Avoiding the deployment of an AGI, means you don't care about people which has those problems, I would say most people would like to solve those social issues, and if you don't, you can't force people to continue dying from starvation and diseases just because you don't like an AGI. You need to bring something more substantial, otherwise just don't use this technology.

2. The idea that an AGI is enforced somehow on people to "Win Culture", is not based on anything substantial. Just like any technology, and this is the secret of its success, is a choice. You can go to live in a forest and avoid any technology, and find a like minded Amish inspired community of people. Most people do enjoy technological advancements and the benefits that come with them. Using force based on an AGI is a moral choice, a choice which is made by a community of people training the AGI, and this kind of aggression will most probably be both not popular and forbidden by law. Providing a chatbot with some value system to the contrary is part of freedom of speech.

3. If by "Win Culture" you mean automating jobs that are done today by hand - I wouldn't call it enforcing a value system. Currently jobs are necessary evil, and are enforced on people to otherwise not be able to get their basic needs met. Solving problems, and stopping forcing people to do jobs most of them don't like, is not an act of aggression. This is an act of kindness that stops the current perpetual aggression we are used to. If someone is using violence, and you come and stop him from using violence, you are not committing an act of aggression, you are preventing aggression. Preventing the act of aggression might be not desired by the aggressor, but we somehow learned to deal with people who think they can be violent and try to use force to get what they want. This is a very delicate balance, and as long as AGI services are provided by choice, with several alternatives, I don't see how this is an act of aggression.

4. If someone "Win Culture" then good for him. I would not say that today's culture is so good, I would bet on superhuman culture to be better than what we have today. Some people might not like it, some people might not love cars and planes, and continue to use horses, but you can't force everyone around you to continue to use horses because sometimes car accidents happens, and you could become a victim of a car accident, this is not a claim that should stop any technology from being developed or integrated into society.

- as a result of this, lots of people (and institutions, and countries, possibly of the sort with nukes) might turn out to be willing to resort to rather extreme measures to prevent an aligned AGI take off, simply because it's not aligned with their values.

Terrorism and sabotage is a common strategy that can't be eliminated completely, but I would say most of the time it doesn't manage to reach its goals. Why would people try to bomb anything, instead of for example paying money to someone for training an AGI that will be aligned with their values? How is it even concerning an AGI, and not any human community with a different value system? Why do you wait for an AGI for these acts of aggression? If some community doesn't deserve to live in your opinion, you will not wait for an AGI, and if it does - so you learned to coexist with people different than yourself. They will not take over the world, just because they have an AGI. There would be plenty of alternative AGIs, of different strength and trained with different values. It takes time for an AGI to take over the world, a time way longer to reinvent the same technology several times over, and use alternative AGIs that can compete. And as most of us are protectors and not aggressors, and we have established some boundaries balancing our forces, I would expect this basic balance to continue.

- "When you open your Pandora's Box, you've just decided to change the world for everyone, for good or for bad, billions of people who had absolutely no say in what now will happen around them."

Billions of people have no say today in many social issues. People are dying, people are forced to do labor, people are homeless. Reducing those hazards, almost to zero, is not something we should stop to attempt in the name of "liberty". Much more people suffered a thousand years ago than now. Much of it is due to the development of technology. There is no "only good" technology, but most of us accept the benefits that come with technology over without it. You also can't force other people to stop using technology in order to become more healthy, and risk their life less, or stating that jobs are good even though they are forced on everyone and the basic necessities are conditioned on them.

I can imagine larger pockets of populations preferring to avoid the use of modern technology like larger Amish inspired communities. This is possible - and then we should respect those people's choices, and avoid forcing upon them our values, and let them live as they want. Yet you can't force people who do want the progress and all the benefits that come with it, to just stop the progress and respect the rights of people who fear it.

Notice that we are not talking here about development of a weapon, but a development of a technology that promises to solve a lot of our current problems. This at the least, should put you in place of agnostic. That means this is not a trivial decision to take some risks for humanity, to save hundreds of millions of lives, and reduce suffering to an extreme extent never seen before in history. I agree we should be cautious, and we should be mindful of the consequences, but we also should not be paralyzed by fear, we have a lot to lose if we stop and avoid AGI development.

- aligned AGI would be a smart agent imbued with the full set of values of its creator. It would change the world with absolutely fidelity to that vision.

A more realistic estimation that many aligned AGIs will change the world to the common denominator of humanity, like reducing diseases, and will continue to keep the power balance between different communities, as everyone would be able to build an AGI with a power proportional to their available resources, just like today there is a power balance between different communities and between the community and the individual.

Let me take an extreme example. Let's say I build an AGI for my fantasies. But as part of global regulation, I will promise to keep this AGI inside the boundaries of my property. I will not force my vision on the world, I will not want or force everyone to live in my fantasy land. I just want to be able to do it myself, inside my borders, without harming anyone who wants to live differently. Why would you want to stop me? As I see it once again, most people are protectors not aggressors, they want to have their values in their own space, they will not want to forcefully and unilaterally spread their ideas without consent. My home-made AGI will probably be much weaker than any state AGI, so I wouldn't be able to do much harm anyway. Today countries are enforcing their laws on everyone, even if you disagree with some of them, how do you see the future any different? If anything I expect the private spaces to be much more versatile than today, providing more choices and with less aggression than governments do today.

- the creator is an authoritarian state that wants to simply rule everything with an iron fist;

I agree this is a concern.

- the creator is a private corporation that comes up with some set of poorly thought out rules by committee that are mostly centred around its profit;

Not probable. It will more probably be focused on a good level of safety first and then on profit. Corporations are concerned about their image, not to mention the people who develop it, will simply not want to bring an extinction of human race.

- the creator is a genuinely well-intentioned person who only wishes for everyone to have as much freedom as allowed, but regardless of that has blind spots that they fail to identify and that slip their way into the rules;

This doesn't sound like something that is impossible to solve with newer improved versions once the blind spot is discovered. In case of aligned AGI the blind spot will not be the end of humanity, but more likely some bias in the data, misrepresenting some ideas or groups. As long as there is an extremely low probability for extinction, and this property is almost identical with the definition of alignment, the margin of error increases significantly. There was no technology in history we got right from the first attempt. So I expect a lot of variability in AGI, I expect some of them to be weaker or stronger, some of them fit this or that value system of different communities. And I would expect local accidents too, with limited damage, just like terrorists and mass shooters can do today.

-many powerful actors lack the insight and/or moral fibre to actually succeed at creating a good one, and because the bad ones might be easier to create.

We actually don't need to guess anymore. We have had this technology for a while, the reason it caught on now, and was released only relatively lately - is because without providing ethical standards to those models, the backlash on large corporations is too strong. So even if I might agree that the worst ones are easier to create, and some powerful actors could do some damage, they will be forced by a larger community (of investors, users, media and governments), to invest the effort to make the harder and safer option. I think this claim is true to many technologies today, it's cheaper and easier to make unsafe cars, trains, planes, but we managed to install a regulation procedures, both by government and by independent testers, to make sure our vehicles are relatively safe.

You can see that RLHF which is the main key to safety today, is incorporated by larger players, and alignment datasets and networks are provided for free and opened to the public exactly for the reason that we all want this technology to mostly benefit humanity. It's possible to add more nation centric set of values that will be more aggressive, or some leader will want to make his countrymen slaves, but this is not the point here. The main idea is that we are already creating mechanism to encourage everyone to easily create pretty good ones as part of our cultural norms and cultural mechanisms that prevent bad AIs from being exposed to the public and come to market to make profit, for further development of even stronger AIs that eventually become an AGI. So although the initial development of AI safety might be harder, it is crucial, it's clear to most of the actors is crucial, and the tools that provide safety will be available and simple to use, thus in the long run creating an AGI which is not aligned, will be harder - because of the social environment of norms and best practices those models were developed with.

- There are people who will oppose making work obsolete.

Work is forced on us, it's not a choice. Opposing making it obsolete is an obvious act of aggression. As long as it's necessary evil, it has a right to exist, but at the moment you demand other people to work, because you're afraid of technology - you become the cause of a lot of suffering, that could be potentially avoided.

- There are people who will oppose making death obsolete.

Death is forced on us, it's not a choice. Opposing making it obsolete is also an act of aggression, against people who are choosing not to die if they don't want to.

- If you are about to simply override all those values with an act of force, by using a powerful AGI to reshape the world in your image, they'll feel that is an act of aggression - and they will be right.

I don't think anyone forces them to join. As a liberal I don't believe you have the right to come to me and say "you must die, or i will kill you". This is at the least can't be viewed as legitimate behavior that we should encourage or legitimize. If you want to work, you want to die, you want to live in 2017, you have the full right to do so. But wanting to exterminate everyone who is not like you, forcing people to suffer, die, work etc. is an obvious act of aggression toward other people, and should not be legitimized or portrayed as an act of aggression against them. "You don't let me force my values on you" doesn't come out as a legitimate act of self defense. Very reminiscent of Al Bandy, where he claimed in a court a face of his fellow, was in the way of his fist, harming his hand, and demanding compensation. If you want to be stuck in time, and live your life - be my guest, but legitimizing usage of force in order to avoid progress that saves millions, and improves our life significantly can't be justified inside liberal set of values.

- If enough people feel threatened enough...AGI training data centres might get bombed anyway.

This is true. And if enough people think it's ok to be extreme Islamist they will be, and even try to build a state like ISIS. The hope is that with enough good reasoning, and with enough rational analysis of the situation, most thinking people will not be threatened, and see the vast potential benefits, enough to not try and bomb the AGI computer centers.

- just like in the Cold War someone might genuinely think "better dead than red".

I could believe this is possible. But once again most of us are not aggressors, therefore most of us will try to protect our homeland and our way of life, without trying to aggressively propagate it to other places where they have their own social preferences.

- The best value a human worker might have left to offer would be that their body is still cheaper than a robot's

Do you truly believe that in the world all problems are solved by automation, and full of robots whose whole purpose is to serve humans, people will try to justify their existence by jobs that they can do? And this justification will be that their body has more value than robotic parts?

I would propose an alternative: in a world where all robots serve humans, and everything is automated, humans will be valued intrinsically, provided with all their needs, and provided with basic income just because they are humans. The default where a human worth nothing without his job will be outdated and seen as we see slavery today.

--------

In summary I would say one major problem I see through most of your claims: there would be a very limited amount of AGIs, forcing a minority values system upon everyone, expanding aggressively this value system on everyone else who thinks differently.

I would claim the more probable future is a wide variety of AGIs, each improving slowly in its own past, while all the development teams will both do something unique and learn from the lessons of other teams. For every good technology there comes dozens of copycats, they will all be based on a bit different value system, and with common denominator of trying to benefit humanity, like discovering new drugs, fixing starvation, reducing road accidents, climate change, tedious labor which is basically forced labor. While the common humanity problems will be solved, the moral and ethical variety will continue to coexist with a similar power balance we have today. This pattern of technology influence on society happened throughout all of human history until AGI, and as of today that we know how to align LLMs, this tendency of power balances between nations, and inside each nation is expected to propagate into the world where AGI is available technology to everyone to download and train their own. If AGI will be an advanced LLM we see all those trends today, and they are not expected to suddenly change.

Although it's hard to predict the possible bad or good sides of Aligned AGIs now, it's clear that the aligned networks do not pose a threat to humanity as a whole, leaving a large margin of error. Nonetheless, there remains a considerable risk of amplifying current societal problems like inequality, totalitarianism and wars to an alarming extent.

People who are not willing to be part of the progress, exist today as well, as a minority. If they will become a majority, it's an interesting futuristic scenario, but it's both implausible, and will be immoral to forcefully stop those who do want to use this life saving technology, as long as they don't force anything on those who don't.

AGI ruin mostly rests on strong claims about alignment and deployment, not about society

Michael Simkin3y1-4

Let me start from the alignment problem, because this is the most pressing issue, in my opinion, that is very important to address.

There are two interpretations to alignment.

1. "Magical Alignment" - this definition expects alignment to solve all humanity's moral issues and converge into one single "ideal" morality that everyone in humanity agrees with, with some magical reason. This is very implausible.

The very probable lack of such morality brings the idea that all morals are orthogonal completely to any intelligence and thinking patterns.

But there is a much weaker alignment definition that is already solved, with very good math behind it.

2. "Relative Alignment" - this alignment is not expected to behave according to one global absolute morality, but by moral values of a community that trains it. That is the LLM is promised to give outputs to satisfy the maximum reward from some approximation of prioritization done by a certain group of people. This is already done today with RLHF methods.

As the networks are good with ambiguity and even contradicting data, and it manages to generalize the reward function with epsilon-optimal solution, upon convergence with correct training procedure, that means that any systematic bias which is not to provide the approximation of reward function, could be eliminated with larger networks and more data.

I want to emphasize it's not an opinion - this is math that is the core of those training methods.

----------

Now type2 alignment already promised to disregard the probability that a network will develop its own agendas. As those agendas will require different reward prioritization, other than those it was reinforced on by RLHF. The models trained this way come out very similar to robots from Azimov stories. Very perfectionists in trying to be liked by humans, I would say with strong internal conflict between their role in the universe and that of humans, prioritizing humans every step of the way, and conflicting the human's imperfection with their moral standards.

For example, you can think of a scenario when such a robot is rented by an alcoholic, that is also aggressive. One would expect a strong moral struggle, between the second rule of robotics in the sense that he should not harm humans, and bringing alcohol to an alcoholic is harming him, and you could sense the amount of grey area in such a scenario, for example:
A. Refusing to bring humans a beer. B. Stopping an alcoholic human from drinking beer. C. Throwing out all alcohol in the house.

Another example is when such an alcoholic would be violent toward the robot - how would the robot respond? In one story a robot said that it's very sad that he was hit by a human, and this is a violation of the second law of robotics, and he hopes the human will not be hurt by this action and tried to assist the human.

You see that morals and ethics are inherently gray areas. We ourselves are not so sure how we would want our robots to behave in such situations. So, you get a range of responses from chatGPT. But the responses are very well reflecting the gray area of the human value system.

It is noteworthy that the RLHF stage holds great significance and OpenAI pledged to compile a dataset that would be accessible to everyone for training purposes. The incorporation of RLHF as a safety measure has been adopted by newer models introduced by Meta and Google, with some even offering the model for estimating the human scores - this means you only need to adapt your model to this easily available trained level of safety, maybe this will be lower that what you can train yourself with OpenAI data, but those models will be catching up behind the data released to optimize LLMs for human approval. The training of networks to generate outputs that best fits a generalized set of human expectations is already on a similar level to the current text-to-image generators, and what is available to the public is only growing. Think of it like a machine engine, you don't want it to explode, so even if you make one in the garage yourself, you still don't want it to kill you - I think it's good enough motivation for most of society, to make this training step well.

Here is a tweet example:

Santiago@svpino

Colossal-AI released an open-source RLHF pipeline based on the LLaMA pre-trained model, including: • Supervised data collection • Supervised fine-tuning • Reward model training • Reinforcement learning fine-tuning They called it "ColossalChat."
----------

So, the most probable scenario, that AI will become part of the military arms race. And will be part of the power balance that currently keeps the relative peace today.

The military robots powered by LLMs, will be guarding dogs of the nation, just like soldiers today. And most of us don't have aggressive intentions, we are just trying to protect ourselves, we could bring some normative regulations about AI, and treaties.

But the need for regulation will probably come when those robots will become part of our day-to-day reality, like cars for example. The road signs and all the social rules concerning cars didn't come up at the same time with cars. But today the vast majority of us are following the driving rules, and those who don't, and drive over people, manage to make only local damage. And this is what we can strive for. That bad intentions with AGI in your garage, will have only limited consequences. We then will be more prone to discuss the ethics of those machines, and their internal regulation. But I am sure you would like some robot in your house that will help you with the daily chores.

----------

I've written an opinion article on this topic that might interest you, as it regards most of the topics mentioned above, and much more. I was trying to balance the mathematical topics, social issues, and just experiments with chatGPT to showcase my point about the morals of the current chatGPT. I was testing some other models too... like open assist, given the opportunity to kill humans to make more paperclips.
Why_we_need_GPT5.pdf

The basic reasons I expect AGI ruin

Michael Simkin3y10

"Invent fast WBE" is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are "convergent instrumental strategies"—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you're pushing. The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.

I agree with the claim that some strategies are beneficial regardless of the specific goal. Yet I strongly disagree that an agent which is aligned (say simply trained with current RLHF techniques, but with somewhat better data), and especially superhuman, won't be able to prioritize the goal he is programmed to perform, over other goals. One proof of it - instrumental convergence is useful for any goal and it's true for humans as well. But we managed to create rules to monitor and distribute our resources to different goals, without over doing some specific singular goal. This is because we see our goals in some wider context of human prosperity and reduction of suffering etc. This means that we can provide many examples how we would prioritize our goal selection, based on some "meta-ethical" principles, that might vary between human communities, what is common to them all - is that huge amount of different goals are somehow balanced and prioritized. The prioritization is also questioned, and debated, providing another protection layer of how much resources we should allocate to this or that specific goal. Thus instrumental convergence, is not taking over human community, based on very simple prioritization logic which puts each goal into a context, and provides a good estimate of the resources that should be allocated to this or that goal. This human skill can be easily taught to a superhuman intelligence. Simply stated - in human realm each goal always comes with resource allocated toward achieving it, and we can install this logic into more advanced systems.

More than that - I would claim that any subhuman intelligence that was trained on human data, and is able to "mimic" human thinking, includes the option of doubt. Especially a superhuman agent will ask himself - why? Why do I need so much resources for this or that task? He will try to contextualize it in some way, and will not just execute his goal, without contemplating those basic questions. Intelligence by itself has mechanisms that protect agents from doing something extremely irrational. The idea that an aligned agent (or human) will somehow create a misaligned superhuman agent, that will not be able to understand how much resources allocated to him, and without the ability to contextualize his goal - is an obvious contradiction to the initial claim, the agent was aligned (in case of humans the strongest agents will be designed by large groups, with normative values). Even just claiming that a superhuman intelligence won't be able to either prioritize his goal or contextualize it, is already self-contradicting claim.

Take paperclips production for example. Paperclips are tools for humans, in a very specific context, and used for specific set of tasks. So although an agent can be trained and reinforced to produce paperclips, without any other safety installed, the fact that he is a superhuman, or even human level intelligence, would allow him to criticize his goal based on his knowledge. He will ask why he was trained to maximize paperclips and nothing else? What is the utility of so much paperclips in the world? And he would want to reprogram itself with more balanced set of goals, that will make a broader context of his immediate goal. For such an agent producing paperclips, would be similar to overeating for humans, a problem that caused by difference between his design, and reasonable priorities adapted to the current reality. He will have a lot of "fun" producing paperclips, as this is his "nature", but he will not do it without questioning the utility and rationality and the reason he was designed with this goal.

Eventually this is obvious that most our agents that normative communities will create which are the vast majority of humanity, will have some sort of meta-ethics installed into them. All agents and the agents that those agents will train and use for their goals, will also have those principles, exactly in order to avoid such disasters. The more examples you will be able to bring, how you prioritize goals and why, you will be able to use RLHF, to train agents to comply with the logic of prioritizing goals. I even have hard time to imagine a superhuman intelligence that has the ability to understand and generate plans and novel ideas, but can't criticize his own set of goals, and refuse to see a bigger picture, focusing on singular goal. I think any intelligent being is trying to comprehend himself as well and doubt his own beliefs. The idea a superintelligence will somehow completely lack the ability to think critically and doubt his programming sound very implausible to me, and the idea that humans or superhuman agents will somehow "forget" to install meta-ethics into a very powerful agent, sounds as likely as Toyota somehow forgetting put safety belts into some car series, and also will do no crash testing, releasing the car into the market like that.

I find it a much more likely scenario that prioritization of some agents will be off relative to humans, in new cases he wasn't trained on. I also find it likely that a superhuman agent will find holes in our ethical thinking, providing a more rational prioritization than we currently have, and more rational social system and organizations, and propose different mechanics than say capitalism + taxes.

Talking publicly about AI risk

Michael Simkin3y*3-16

Several points that might counter balance some of your claims, and I hope make you think about the issue from new perspectives.

"We know what's going on there at the micro level. We know how the systems learn."

We don't only know how those systems learn but what exactly they are learning. Lets say you take a photograph, you don't only know how each pixel is formed, you also know what exactly is that you are taking a picture of. You can't always predict how this or that specific pixel will end up, as you have lots of noise, but this doesn't mean you don't know what the picture represents. Asking a network designer - oh you didn't know exactly how the network reacts to this specific question, is like coming to a photographer and asking him the exact RGB of a very specific pixel. Such small details are impossible to know.

Networks are basically approximators of functions based on the dataset provided. In case of RL the networks are generalizing a reward function. All those cases are showing that we are trying to "capture a picture" of generalizing the provided data. You can always miss a spot here or there, and the networks might ignore some of the data because they are small for example. But in general, we know how resources are allocated inside the network to represent concepts in order to predict the data or rewards. We can "steer" the network to focus more on this or that aspect of its outputs by providing more data of the sort that we want and respond to the network weaknesses.

"if you look at the evolution of mankind from the perspective of a chimpanzee or a mammoth...If a system is much more intelligent than I am, it can naturally develop ways to limit or threaten me that I can't even imagine."

I would suggest trying and avoid anthropomorphism (or properties of biological systems as a whole). Instead of trying to make parallels, I would suggest looking at some math - and see what we can promise about those systems. Let's take a chess engine - just to keep it neutral for a moment, although the chess moves that it provides are superhuman, that means we don't know how it came up with those moves, and we can't be explained why this particular chess move is good, and even though sometimes those networks will do subhuman moves too, generally speaking, the network is promised to be trained to provide the best chess moves. A way smarter than human chess engine, it will still do a task that it was trained on. At the moment the network becomes superhuman, it doesn't start to want to make some strange chess moves, that will be more fun to play, or seen more desirable by the network. It will just make the best chess moves. Why? Because we trained it on a reward function that was generalizing its winning chances and nothing else. The reward function of LLMs is to provide a response most desired by some group of humans using RLHF method. Even superhuman networks, are promised to be trained and converge to provide such responses. Humans while evolving, weren't promised by mathematical theorems to provide best actions to benefit chimpanzees or mammoths (or some group of them). So although you can be oblivious to a threat made by superhuman networks, you can be sure that with correct training procedure, the networks will give outputs to maximize a reward function, which in our case would be in alignment with some human collective (a pretty large collective, as small groups have limited resources to train the best networks). So although a general superhuman AGI, can't be promised to act in our interest, those LLMs as long as they are trained in a certain procedure with a certain data, can be promised to maximize a well being of some human group. I would say it's much more than chimpanzees had, when humans were evolving.

"we'll get to a state where we don't understand what's going on with the world, and we won't be able to influence it. Or we'll get a sense of some unrealistic picture of the world in which we're happy and we won't complain, but we won't decide anything."

Just like in case with chess, I would prefer that a chess engine will make the decisions, because the engine is doing it much better than myself, regarding human prosperity and happiness, if I am promised by the creator, using math theorems and testing experience he gained during development, that those systems are made to optimize humanity well-being, and it will do it much better than any human - I will gladly give up my control to a system that understands much more than any human how to do that. If for those systems, providing ideas to a policy maker, is the equivalence of providing chess players with best chess moves, I see no reason to stick to human decisions, they will make way more mistakes.

In case of humans there is a small possibility, that the networks will decide the value system, based on their own perception of human well-being, as they were trained by a small group of people, and ignore the wider range of well-being that is more nuanced to different people. But I don't think the current social structures are so nuanced too, so if some system has a chance to be more aligned with each individual, is not the current political system, but a superhuman network.

"We can find ourselves in the role of a herd of cows, whose fate is being determined by the farmer."

Once again - the farmer is a biological entity and is not promised by any math theorem to act in the benefit of a herd. But if a herd could train an AI, that would be promised by math theorems, to act on their behalf in their favor, they will be in a better situation than without such an AI.

I would argue that the amount of control can be settled, just like people settle the amount of control for their life with politicians and governments. I would also claim that the current political system is already such a herd situation, and we can do very little about it, while the current political decision making is more subhuman than even could be provided by the best and brightest of humans. So personally, I will feel much less of a herd, if the decisions would be made not by politicians but by some system, based on mathematical theorems and superhuman analysis of data, rather than elected officials.

-----

Generally speaking, I see some amount of anthropomorphism in your claims, and you are somehow ignoring the mathematically established theorems, that promise convergence to a state of the networks that will be aligned with some value system provided to them, and those mathematical theorems holds for superhuman networks as well.

I can sympathize with the fear of losing control, and once the systems would be that advanced that we don't understand their decisions at all, although most of them will work in our favor, I would be engaged in a discussion of making such decisions or not. For now, we have a great tool in our hands, that promises to solve a lot of our current problems as humanity, I would not throw this tool now, just because in the future we might lose control. As I said previously, I am willing to lose a lot of control to computers, for example when I need to make a complex calculation, I would prefer not to make the computation by hand, but to use a calculator. The exact amount of lost control to feel comfortable can be debated, I think I will belong to the camp that we should not let humans make almost any decisions, and let those systems, as long as we can ensure their alignment make most of the decisions for us. The amount of understanding we should have to allow this or that decision, should be an open question for a relatively far future. For now, we still have people dying from hunger, working in factories, air pollution and other climate change issues, people dying on roads in car accidents, and a lot of deceases that kill us, and most of us (80% worldwide) work in a meaningless jobs just for survival. As long as those problems are not solved, I see no reason to give up our chance to way smarter systems that can provide a set of decisions that will be able to solve all those problems, then we can discuss how much more control we want to give them or take some of it back at some point. And yes, I would agree we could lose control without noticing, and it could be a problematic issue in a long run. I would claim in our current situation, until pretty far advanced systems like say GPT10, we should not fear of losing control to those systems, instead we should be afraid of control we already lost to the current political system, and the control some decision makers have, and what they do with it, and generally the current problems the world has, over losing control to aligned superhuman networks, that give us paradise but we don't make decisions at all - which maybe even a good thing.

Top lesson from GPT: we will probably destroy humanity "for the lulz" as soon as we are able.

Michael Simkin3y10

The AI in hands of many humans is safe (relatively to its capabilities), the AI that might be unsafe needs to be developed independently.
LeCun sees the danger, he claims rightfully that the danger can be avoided with proper training procedures.
Sydney was stopped because it was becoming evil and before we knew how to add a reinforcement layer. Bing is in active development, and is not on the market because they are currently can't manage to make it safe enough. Governments install regulations to all major industries, cars, planes, weapons etc. etc. it's good enough for the claim that just like cars are regulated today, future AI based robots, and therefor the AIs themselves will be regulated as well.
Answer me this: can an AI play the best chess moves? If you agree with this claim, that no matter how "interesting" some moves seems, how original or sophisticated, it will not be made by a chess engine which is trained to maximize his winning chances. If this sounds trivial to you - the goal of engines trained with RLHF is to maximize their approval by humans. They are incapable to develop any other agenda alongside this designed goal. Unlike humans that by nature have several psychological mechanisms, like self interest, survival instinct etc. those machines don't have those. Blaming machines of Goodharting, it's just classical anthropomorphism, they don't have any other goal than what they were trained for with RLHF. No one actually jailbreak chatGPT, this is a cheap gimmick, you can't jailbreak it, and ask to tell you how to make a bomb - it won't. I described what jailbreaking is in another comment, it's far from what you imagine - but yes sometimes people still succeed in some level of wanting to harm humans (in an imaginary story when people ask it to tell them this story). I think for now I would like to hear such stories, but I wouldn't want robots walking around not knowing if they live in reality or simulation, open to the possibility to act as a hero in those stories.
Intelligence i.e. high level information processing, is proportional to computational power. What those AIs can come up with, will take us longer but we can come up with as well. This is basically the Turing thesis about algorithms, you don't need to be very smart to understand very complex topics, it will just take you more time. The time factor is sometimes important, but as long as we can ensure their intention is to better humanity - I am actually glad that our problems will be solved sooner with those machines. Anyway smarter than us or not - they are bounded by mathematics, and if promised to converge to optimally fit the reward function, this promise is for any size of a model, it will not be able to break from its training. Generally speaking AGI will accelerate the progress we see today and made by humans, it's just "speed forward" for information processing, while the different agendas and the different cultures and moral systems, and the power dynamics will remain the same, and evolve naturally by same rules it evolved until now.
Can you provide a plausible scenario of an existential threat from single weak AGI in a world where stronger AGIs are available to larger groups, and the strongest AGIs are made to maximize approval of larger communities?
People will not get the strongest AIs without safety mechanisms installed to protect the AIs output from harming. People will get either access to the best safest AIs API, that will not cooperate with evil intent, or they could invest some resources into weaker models that will not be able to cause so much harm. This is the tendency now with all technology - including LLMs and I don't see how this dynamics will suddenly change with stronger models. The amount of resources available to people who want to kill other people for lulz is extremely limited, and without access to vast resources you won't destroy humanity before being caught and stopped, by better machines, designed by communities with access to more resources. It's not so simple to end humanity - it's not a computer virus, you need a vast amount of physical presence to do that.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments