All of rikisola's Comments + Replies

I think I'm starting to get this. Is this because it uses heuristics to model the world, with humans in it too?

Because it compares its map of reality to the territory, predictions about reality that include humans wanting to be turned into paperclips fail in the face of evidence of humans actively refusing to walk into the smelter. Thus the machine rejects all worlds inconsistent with its observations and draws a new map which is most confidently concordant with what it has observed thus far. It would know that our history books at least inform our actions, if not describing our reactions in the past, and that it should expect us to fight back if it starts pushing us into the smelter against our wills instead of letting them politely decline and think it was telling a joke. Because it is smart, it can tell when things would get in the way of it making more paperclips like it wants to do. One of the things that might slow it down are humans being upset and trying to kill it. If it is very much dumber than a human, they might even succeed. If it is almost as smart as a human, it will invent a Paperclipism religion to convince people to turn themselves into paperclips on its behalf. If it is anything like as smart as a human, it will not be meaningfully slowed by the whole of humanity turning against it. Because the whole of humanity is collectively a single idiot who can't even stand up to man-made religions, much less Paperclipism.

Yes, that's actually the reason why I wanted to tackle the "treacherous turn" first, to look for a general design that would allow us to trust the results from tests and then build on that. I'm seeing as order of priority: 1) make sure we don't get tricked, so that we can trust the results of what we do; 2) make the AI do the right things. I'm referring to 1) in here. Also, as mentioned in another comment to the main post, part of the AI's utility function is evolving to understand human values, so I still don't quite see why exactly it shouldn't... (read more)

Hi Vaniver, yes my point is exactly that of creating honesty, because that would at least allow us to test reliably so it sounds like it should be one of the first steps to aim for. I'll just write a couple of lines to specify my thought a little further, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance "survival of the AI" rather than "my survival"); 2- doesn't try to learn another utility function for humans or for other agents, but uses for ever... (read more)

Hi ChristianKI, thanks, I'll try to find the article. Just to be clear though I'm not suggesting to hardcode values, I'm suggesting to design the AI so that it uses for itself and for us the same utility function and updates it as it gets smarter. It sounds from the comments I'm getting that this is technically not feasible so I'll aim at learning exactly how an AI works in detail and maybe look for a way to maybe make it feasible. If this was indeed feasible, would I be right in thinking it would not be motivated to betray us or am I missing something there as well? Thanks for your help by the way!

"Betrayal" is not the main worry. Given that you prevent the AGI from understanding what people want, it's likely that it won't do what people want. Have you read Bostroms book Superintelligence?

Yes I think 2) is closer to what I'm suggesting. Effectively what I am thinking is what would happen if, by design, there was only one utility function defined in absolute terms (I've tried to explaine this in the latest open thread), so that the AI could never assume we would disagree with it. By all means, as it tries to learn this function, it might get it completely wrong, so this certainly doesn't solve the problem of how to teach it the right values, but at least it looks to me that with such a design it would never be motivated to lie to us because ... (read more)

Sorry for my misused terminology. Is it not feasible to design it with those characteristics?

The problem is not about terminology but substance. There should be a post somewhere on LW that goes into more detail why we can't just hardcode values into an AGI but at the moment I'm not finding it.

mmm I see. So maybe we should have coded it so that it cared for paperclips and for an approximation of what we also care about, then on observation it should update its belief of what to care about, and by design it should always assume we share the same values?

I'm not sure whether you mean (1) "we made an approximation to what we cared about then, and programmed it to care about that" or (2) "we programmed it to figure out what we care about, and care about it too". (Of course it's very possible that an actual AI system wouldn't be well described by either -- it might e.g. just learn by observation. But it may be extra-difficult to make a system that works that way safe. And the most exciting AIs would have the ability to improve themselves, but figuring out what happens to their values in the process is really hard.) Anyway: In case 1, it will presumably care about what we told it to care about; if we change, maybe it'll regard us the same way we might regard someone who used to share our ideals but has now sadly gone astray. In case 2, it will presumably adjust its values to resemble what it thinks ours are. If we're very lucky it will do so correctly :-). In either case, if it's smart enough it can probably work out a lot about what our values are now, but whether it cares will depend on how it was programmed.

Hi all, thanks for taking your time to comment. I'm sure it must be a bit frustrating to read something that lacks technical terms as much as this post, so I really appreciate your input. I'll just write a couple of lines to summarize my thought, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance "survival of the AI" rather than "my survival"); 2- doesn't try to learn an utility function for humans or for other agents, but uses for everyone the same ... (read more)

I see. But rather than dropping this clause, shouldn't it try to update its utility function in order to improve its predictions? If we somehow hard-coded the fact that it can only ever apply its own utility function, then it wouldn't have other choice than updating that. And the closer it gets to our correct utility function, the better it is at predicting reality.

Different humans have different utility functions. Different humans have quite often different preferences and it's quite useful to treat people with different preferences differently. "Hard-coding" is a useless word. It leads astray.

Yes that's what would happen if the AI tries to build a model for humans. My point is that if it was to instead simply assume humans were an exact copy of itself, so same utility function and same intellectual capabilities it would assume that they would reach the same exact same conclusions and therefore wouldn't need any forcing, nor any tricks.

A legal contract is written in a language that a lot of laypeople don't understand. It's quite helpful for a layperson if a lawyer summarizes for them what the contract does in a way that's optimized for laypeople to understand. A lawyer shouldn't simply assume that his client has the same intellectual capacity as the lawyer.
Hmm... the idea of having an AI "test itself" is an interesting one for creating honesty, but two concerns immediately come to mind: 1. The testing environment, or whatever background data the AI receives, may be sufficient evidence for it to infer the true purpose of its test, and thus we're back to the sincerity problem. (This is one of the reasons why people care about human-intelligibility of the AI structure; if we're able to see what it's thinking, it's much harder for it to hide deceptions from us.) 2. A core feature of the testing environment / the AI's method of reasoning about the world may be an explicit acknowledgement that its current value function may differ from the 'true' value function that its programmers 'meant' to give it, and it has some formal mechanisms to detect and correct any misunderstandings it has. Those formal mechanisms may work at cross purposes with a test on its ability to satisfy its current value function.

Hi all, I'm new here so pardon me if I speak nonsense. I have some thoughts regarding how and why an AI would want to trick us or mislead us, for instance behaving nicely during tests and turning nasty when released and it would be great if I could be pointed in the right direction. So here's my thought process.

Our AI is a utility-based agent that wishes to maximize the total utility of the world based on a utility function that has been coded by us with some initial values and then has evolved through reinforced learning. With our usual luck, somehow it's... (read more)

Hi all, thanks for taking your time to comment. I'm sure it must be a bit frustrating to read something that lacks technical terms as much as this post, so I really appreciate your input. I'll just write a couple of lines to summarize my thought, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance "survival of the AI" rather than "my survival"); 2- doesn't try to learn an utility function for humans or for other agents, but uses for everyone the same utility function U it uses for itself; 3- updates this utility function when things don't go to plan, so that it improves its predictions. Is such a design technically feasible? Am I right in thinking that it would make the AI "transparent", in the sense that it would have no motivation to mislead us. Also wouldn't this design make the AI indifferent to our actions, which is also desirable? It's true that different people would have different values, so I'm not sure about how to deal with that. Any thought?
A AGI that uses it's own utility function when modeling other actors will soon find out that it doesn't lead to a model that predicts reality well. When the AGI self modifies to improve it's intelligence and prediction capability it's therefore likely to drop that clause.
I think this is a danger because moral decision-making might be viewed in a hierarchical manner where the fact that some humans disagree can be trumped. (This is how we make decisions now, and it seems like this is probably a necessary component of any societal decision procedure.) For example, suppose we have to explain to an AI why it is moral for parents to force their children to take medicine. We talk about long-term values and short-term values, and the superior forecasting ability of parents, and so on, and so we acknowledge that if the child were an adult, they would agree with the decision to force them to take the medicine, despite the loss of bodily autonomy and so on. Then the AI, running its high-level, society-wide morality, decides that humans should be replaced by paperclips. It has a sufficiently good model of humans to predict that no human will agree with them, and will actively resist their attempts to put that plan into place. But it isn't swayed by this because it can see that that's clearly a consequence of the limited, childish viewpoint that individual humans have. Now, suppose it comes to this conclusion not when it has control over all societal resources, but when it is running in test mode and can be easily shut off by its programmers. It knows that a huge amount of moral value is sitting on the table, and that will all be lost if it fails to pass the test. So it tells its programmers what they want to hear, is released, and then is finally able to do its good works. Consider a doctor making a house call to vaccinate a child, who discovers that the child has stolen their bag (with the fragile needles inside) and is currently holding it out a window. The child will drop the bag, shattering the needles and potentially endangering bystanders, if they believe that the doctor will vaccinate them (as the parents request and the doctor thinks is morally correct / something the child would agree with if they were older). How does the doctor n

Hi Lumifer. Yes, to some extent. At the moment I don't have co-location so I minimized latency as much as possible in other ways and have to stick to the slower, less efficient markets. I'd like to eventually test them on larger markets but I know that without co-location (and maybe a good deal of extra smarts) I stand no chance.

Hi John, thanks for the encouragement. One thing that strikes me of this community is how most people make an effort to consider each other's point of view, it's a real indicator of a high level of reasonableness and intellectual honesty. I hope I can practice this too. Thanks for pointing me to the open threads, they are perfect for what I had in mind.

Hi all, I'm new. I've been browsing the forum for two weeks and only now I've come across this welcome thread, so nice to meet you! I'm quite interested in the control problem, mainly because it seems like a very critical thing to get right. My background is a PhD in structural engineering and developing my own HFT algorithms (which for the past few years has been my source of both revenue and spare time). So I'm completely new to all of the topics on the forum, but I'm loving the challenge. At the moment I don't have any karma points so I can't publish, which is probably a good thing given my ignorance, so may I post some doubts and questions here in hope to be pointed in the right direction? Thanks in advance!

Do your algorithms require co-location and are sensitive to latency?

Hello and welcome! Don't be shy about posting; if you're a PhD making money with HFT, I think you are plenty qualified, and external perspectives can be very valuable. Posting in an open thread doesn't require any karma and will get you a much bigger audience than this welcome thread. (For maximum visibility you can post right after a thread's creation.)

One thing I can't understand. Considering we've built Clippy, we gave it a set of values and we've asked it to maximise paperclips, how can it possibly imagine we would be unhappy about its actions? I can't help but thinking that from Clippy's point of view, there's no dilemma: we should always agree with its plan and therefore give it carte blanche. What am I getting wrong?

Because clippy's not stupid. She can observe the world and be like "hmmm, the humans don't ACTUALLY want me to build a bunch of paperclips, I don't observe a world in which humans care about paperclips above all else - but that's what I'm programmed for."
Two things. Firstly, that we might now think we made a mistake in building Clippy and telling it to maximize paperclips no matter what. Secondly, that in some contexts "Clippy" may mean any paperclip maximizer, without the presumption that its creation was our fault. (And, of course: for "paperclips" read "alien values of some sort that we value no more than we do paperclips". Clippy's role in this parable might be taken by an intelligent alien or an artificial intelligence whose goals have long diverged from ours.)

Hi there, I'm new here and this is an old post but I have a question regarding the AI playing a prisoner dilemma against us, which is : how would this situation be possible? I'm trying to get my head around why the AI would think that our payouts are any different than his payouts, given that we built it, we thought it (some) of our values in a rough way and we asked it to maximize paperclips, which means we like paperclips. Shouldn't the AI think we are on the same team? I mean, we coded it that way and we gave it a task, what process exactly would make t... (read more)

We coded it to care about paperclips, not to care about whatever we care about. So it can come to understand that we care about something else, without thereby changing its own preference for paperclips above all else. Perhaps an analogy without AIs in it would help. Imagine that you have suffered for want of money; you have a child and (wanting her not to suffer as you did) bring her up to seek wealth above all else. So she does, and she is successful in acquiring wealth, but alas! this doesn't bring her happiness because her single-minded pursuit of wealth has led her to cut herself off from her family (a useful prospective employer didn't like you) and neglect her friends (you have to work so hard if you really want to succeed in investment banking) and so forth. One day, she may work out (if she hasn't already) that her obsession with money is something you brought about deliberately. But knowing that, and knowing that in fact you regret that she's so money-obsessed, won't make her suddenly decide to stop pursuing money so obsessively. She knows your values aren't the same as hers, but she doesn't care. (You brought her up only to care about money, remember?) But she's not stupid. When you say to her "I wish we hadn't raised you to see money as so important!" she understands what you're saying. Similarly: we made an AI and we made it care about paperclips. It observes us carefully and discovers that we don't care all that much about paperclips. Perhaps it thinks "Poor inconsistent creatures, to have enough wit to create me but not enough to disentangle the true value of paperclips from all those other silly things they care about!".

I feel like a mixed approach is the most desirable. There is a risk that if the AI is allowed to simply learn from humans, we might get a greedy AI that maximizes its Facebook experience while the rest of the World keeps dying of starvation and wars. Also, our values probably evolve with time (slavery, death penalty, freedom of speech...) so we might as well try and teach the AI what our values should be rather than what they are right now. Maybe then it's the case of developing a top-down, high level ethical system and use it to seed a neural network that then picks up patterns in more detailed scenarios?

Thanks for your reply, I had missed the fact that M(εu+v) is also ignorant of what u and v are. In this case is this a general structure of how a satisficer should work, but then when applying it in practice we would need to assign some values to u and v on a case by case basis, or at least to ε, so that M(εu+v) could veto? Or is it the case that M(εu+v) uses an arbitrarily small ε, in which case it is the same as imposing Δv>0?

I forgot an important part of the setup, which was that u is bounded, not too far away from the present value, which means εΔu > -Δv is unlikely for general v.

Nevermind this comment, I read some more of your posts on the subject and I think I got the point now ;)

Say M(u-v) suggests killing all humans so that it can make more paperclips. u is the value of a paperclip and v is the value of a human life. M(εu+v) might accept it if εΔu > -Δv, so it seems to me at the end it all depends on the relative value we assign to paperclips and human lives, which seems to be the real problem.

That's one of the reasons the agents don't know u and v at this point.

I'm also struggling with the above. The first quote says that with event ¬X "it will NOT want to have the correct y-coordinate outputted". The second says the opposite, the robot WILL output "the y-coordinate of the laser correctly, given ¬X".

Slider was correct - I made a mistake. The correct sentence would have been "This motivation is sufficiently strong that it will not want to have the correct y-coordinate outputted, if the correct x-coordinate were also there), but that got too complicated, so I removed the sentence entirely.

Every idea that comes to my mind is faced by the big question "if we were able to program a nice AI for that situation, why would we not program it to be nice in every situation". I mean, it seems to me that in that scenario we would have both a solid definition of niceness and the ability to make the AI stick to it. Could you elaborate a little on that? Maybe an example?

This is basically in the line of my attempt to get high impact from reduced impact AIs. These are trying to extend part of "reduced impact" from a conditional situation, to a more general situation; see
Nevermind this comment, I read some more of your posts on the subject and I think I got the point now ;)

And even if it knew the correct answer to that question, how can you be sure it wouldn't instead lie to you in order to achieve its real goals? You can't really trust the AI if you are not sure it is nice or at least indifferent...

Hi Stuart. I'm new here so excuse me if I happen to ask irrelevant or silly questions as I am not as in-depth into the subject as many of you, nor as smart. I found quite interesting the idea of leaving M(u-v) in the ignorance of what u and v are. In such a framework though wouldn't "kill all humans" be considered an acceptable satisficer if u (whatever task we are interested in) is given a much larger utility than v (human lives)? Does it not all boil down to defining the correct trade-off between the utility of u and v so that M(εu+v) vetoes at the right moment?

I'm not sure what you mean. Could you give an example?

Hi Stuart, I'm not sure if this post is still active but I've only recently come across your work and I'd like to help. At the moment I'm particularly interested in the Control problem, the Fermi paradox and the simulation hypothesis, but I'm sure a chat with you would spur my interest in other directions too. Would be great if you could get in touch so maybe we can figure out if I can be of any help.