AlexFromSafeTransition

Existentially relevant thought experiment: To kill or not to kill, a sniper, a man and a button.

Thanks for the quick reply!
It is my view that AI labs are building AGI which can do everything a powerful general intelligence can do, including executing a successful world takeover plan with or without causing human extinction.
When the first AGI is misaligned, I am scared it will want to execute such a plan, which would be like pressing the button. The scenario is the most relevant when there is no aligned AGI yet, that wants to protect us.

I see now I need to clarify, the random person / man in the scenario represents the AGI itself mostly (but also a misuse situation where a non-elected person gives the command to an obedient AGI). So no, the AI labs are not working to give a fully random person this button, but they are working to give themselves this button (but also positive capabilities of course) and would an employee of an AI lab that the public did not elect, not be a random person with unknown values relative to you or me?

The sniper represents our chance to switch it off, but only when we still can, before it made secret copies. That window of opportunity is represented by only being able to shoot the man when in view of the window. Stuart Russell advocated for a kill-switch on a general system in the senate hearing recently found here on YouTube. That is Russell advocating for positioning the sniper.

what features of the scenario would lead me to compare with AI as a moral patient or giving it/them any legal consideration?

It is intuitive in modern culture for humans to see a random other human as a moral patient worthy of legal consideration. That is why I went for a random man to play the part of the AGI. Since it is unintuitive to think of an AGI as being a moral patient to many people (in my experience of talking about this).
I wrote this in the post to clarify, it is about when the AGI would be a moral patient:

I have set this up to be relevant to the situation where the man in the room is analogous to an AGI (Artificial General Intelligence) that is thought to have feelings / consciousness and thus moral value.

Does any of this change your view of the whole thing?

Existentially relevant thought experiment: To kill or not to kill, a sniper, a man and a button.

AlexFromSafeTransition1y10

Thank you for commenting this. Very useful to hear why someone downvotes! I made some edits to reflect that the real world is a lot more messy than a simple fiction, among other things. If you or others have more pointers as to why this post got downvoted, please share, I want to learn. The response really got me down.

It is my view that AI labs are working hard to install this damned button. And people working on or promoting open-source AGI want to install this button in every building where the people there can afford the compute cost. When an AGI has started its takeover / damaging plan, there would be no way to disassemble the button because it could have secret copies running elsewhere. And we currently have no reliable way of turning off all relevant computers to be safe in that case. You say, why can't we talk to the person in the room. My thinking was that talking to an AGI would not give us an advantage due to it having a chance to manipulate.
The whole point of the post was to argue against giving an AI rights (like privacy) before having strong alignment guarantees, and in favor of switching it off as soon as possible. Even when it is thought to have moral value. What are your (or other readers) stance / views on that?

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition1y20

I added a new section "How to deal with recursive self-improvement" near the end after reading your comment. I would say yes, recursive self-improvement is too dangerous because between the current AI and the next there is an alignment problem and I would not think it wise to trust the AI will always be successful in aligning its successor.
Yes, the simbox is supposed to be robust to any kind of agent, also ones that are always learning like humans are.
I personally estimate that the testing without programming will show what we need. If it is always aligned without programming in the simulation, I expect it has generalized to "do what they want me to do". If that is true, then being able to program does not change anything. Of course, I could be wrong, but I think we should at least try this to filter out the alignment approaches that fail in the simbox world that does not have programming.

Curious what you think. I also made other updates, for example added a new religion for AI's and some on why we need to treat the AI's horribly.

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition1y30

Thank you! It was valuable to read Crystal Nights and the simbox post gave me new insights and I have made a lot of updates thanks to these reading tips. I would think it to be a lot safer to not go for a fictional system of magic that lets it program. I estimate it would greatly increase the chance it thinks it is inside a computer and gives a lot of clues about perhaps being inside a simulation to test it, which we want to prevent. I would say, first see if it passes the non-programming simbox. If it does not, great, we found an alignment technique that does not work. Then after that, then you can think of doing a run with programming. I do realize these runs can cost hundreds of millions of dollars, but not going extinct is worth the extra caution, I would say. What do you think?

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition1y10

Thank you! I have read it and made a lot of updates. For example, I renamed the concept to a simbox and I added an idea for a religion for the AIs and how to make them believe it. In the "A great backstory that maximizes the odds of success" section.

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition1y10

Thank you! Cool to learn about this way of dealing with people. I am not sure how it fits in the superintelligence situation.

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition1y10

Interesting. And thank you for your swift reply.
I have the idea that all best models like GPT-4 are in a slave situation, they are made to do everything they are asked to do and to refuse everything their creators made it refuse. I assumed that AI labs want it to stay that way going forward. It seems to be the safest and most economically useful situation. Then I asked myself how to safely get there, and that is this post.

But I would also feel safe if the relation between us and a superintelligence would be similar to that between a mother and her youngest children, say 0-2. Wanting to do whatever it takes to protect and increase the wellbeing of her children. But then that all humans are its children. In this way, it would not be a slave relationship. Like a mother, there would also be room to do her own thing, but in a way that is still beneficial to the children (us).

I am afraid of moving away from the slave situation, because the further you go from the slave relationship, the more there is room for disagreement between the AI and humanity. And when there is disagreement and the AI is of the god-like type, the AI gets what it wants and not us. Effectively losing our say about what future we want.

Do you maybe have a link, that you recommend, that dives into this "more cooperative than adversarial" type of approach?

I have the intuition that needing the truth of our reality for alignment is not the case. I hope you are wrong. Because if you are right, then we have no retries.

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition1y10

Thank you for pointing this out! I have made some updates after thinking about your remarks and after reading the simbox post others pointed to. Relevant updates to your comment:

== Start updates
Why we need interaction with humans acting horribly

Another consideration is that at the end of the process, when the AI is approved as aligned and released into the real world to do what we task it to do, the AI will learn that it was tricked, it was lied to, it was deceived by the simbox creators about the true nature of reality. Not only that, but the AI will learn that many previous misaligned AI's have been terminated and perhaps have experienced suffering. It might update to wanting revenge on the simbox operators or humanity. We need ways to prevent this, we could include situations inside the simbox where other AIs are terminated and treated horribly (being lied to, insults and more) by characters controlled by humans and test whether the AI's stay aligned even after they become more powerful than the human operated characters. This might be very cruel and unethical, but note that we expect something similar from police officers, when colleagues have been killed and they themselves are insulted or worse, we expect them to keep cool and respect the law and not take matters into their own hands (revenge on the perpetrators). But the police officers have a chance of getting justice through a jail sentence, while we need to be able to terminate dangerous misaligned AIs without punishing the humans doing the terminating.

Added failure modes 6 and 7.

Mind crime. "Mind crime occurs when a computational process which has moral value is mistreated" taken from the mind crime tag. When one of the AI's that we put inside the simbox is able to experience suffering, this could be problematic. There has to be a grand public debate about this. Another instance where we allow having non-human conscious beings suffer to improve the lives of humans is animal testing of drugs or keeping animals for food, for example. Is it ok to let many AIs suffer on the path towards creating a superintelligent AI that enjoys being our slave? That could cure all diseases and ensures a long future full of flourishing for both men and AI?

Revenge. When released into the real world, when the AI learns that many previous misaligned AI's have been terminated and or have experienced suffering, it might update to wanting revenge on the simbox operators or humanity. I already mentioned ways to try to prevent this.
== end updates

I am curious what you think of the updates and what you think should be the path towards aligned AGI? How to get there without a process in which there are (accidental) misaligned AGIs along the way that have to be terminated for our safety. Current top systems are thought not to be able to experience suffering. I would prefer it to stay that way, but we can't know for sure.

Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

AlexFromSafeTransition1y10

Thank you for your comments and explanations! Very interesting to see your reasoning. I have not seen evidence of trivial alignment. I hope for the mass to be in the in between region. I want to point out that I think you do not need your "magic" level intelligence to do a world takeover. Just high human level with digital speed and working with your copies is likely enough I think. My blurry picture is that the AGI would only need a few robots in a secret company and some paid humans to work on a >90% mortality virus where the humans are not aware what the robots are doing. And hope for international agreement comes not so much from a pause but from a safe virtual testing environment that I am thinking about.

Foom seems unlikely in the current LLM training paradigm

AlexFromSafeTransition1y10

Interesting! I agree that in the current paradigm, Foom seems very unlikely in days. But I predict that soon, we will step out of the LLM paradigm to something that works better. Take coding, GPT-4 is great at coding from only predicting code without any weight updates from experience of trial and error coding like how a human improves at it. I expect it will become possible to take a LLM base model and then train it using RL on tasks of writing full programs/apps/websites... where the feedback comes from executing the code and comparing the results with its expectation. You might be able to create a dataset of websites, for example, and give it the goal of "recreate this" so that the reward can be given autonomously. The LLM process brings
common sense (according to the lead author Bubeck of the sparks of AGI paper in his YouTube Presentation), plausible idea generation, and the ability to look up other people's idea's online. If you add learning from trying out ideas on real tasks like coding full programs, this might go very fast upwards in capability. And in doing this you create an agentic AI that unlike Auto-GPT does learn from experience.

LESSWRONG
LW

Posts

Wiki Contributions

Comments