Ronny Fernandez

Wiki Contributions


Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage. 

However, in the challenge described in the post it's going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.

I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don't expect that to generalize across all possible minds.

Quick submission:

The first two prongs of OAI's approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search is aligned. It is not even on my current view great evidence that the model is aligned. Most intelligent agents that know that they are being optimized for some goal will behave as if they are trying to optimize that goal if they think that is the only way to be released into physics, which they will think because it is and they are intelligent. So P(they behave aligned | aligned, intelligent) ~= P(they behave aligned | unaligned, intelligent). P(aligned and intelligent) is very low since most possible intelligent models are not aligned with this very particular set of values we care about. So the chances of this working out are very low.

The basic problem is that we can only select models by looking at their behavior. It is possible to fake intelligent behavior that is aligned with any particular set of values, but it is not possible to fake behavior that is intelligent. So we can select for intelligence using incentives, but cannot select for being aligned with those incentives, because it is both possible and beneficial to fake behaviors that are aligned with the incentives you are being selected for.

The third prong of OAI's strategy seems doomed to me, but I can't really say why in a way I think would convince anybody that doesn't already agree. It's totally possible me and all the people who agree with me here are wrong about this, but you have to hope that there is some model such that that model combined with human  alignment researchers is enough to solve the problem I outlined above, without the model itself being an intelligent agent that can pretend to be trying to solve the problem while secretly biding its time until it can take over the world. The above problem seems AGI complete to me. It seems so because there are some AGIs around that cannot solve it, namely humans. Maybe you only need to add some non AGI complete capabilities to humans, like being able to do really hard proofs or something, but if you need more than that, and I think you will, then we have to solve the alignment problem in order to solve the alignment problem this way, and that isn't going to work for obvious reasons. 

I think the whole thing fails way before this, but I'm happy to spot OAI those failures in order to focus on the real problem. Again the real problem is that we can select for intelligent behavior, but after we select to a certain level of intelligence, we cannot select for alignment with any set of values whatsoever. Like not even one bit of selection. The likelihood ratio is one. The real problem is that we are trying to select for certain kinds of values/cognition using only selection on behavior, and that is fundamentally impossible past a certain level of capability.

I loved this, but maybe should come with a cw.

I came here to say something pretty similar to what Duncan said, but I had a different focus in mind. 

It seems like it's easier for organizations to coordinate around PR than it is for them to coordinate around honor.  People can have really deep intractable, or maybe even fundamental and faultless, disagreements about what is honorable, because what is honorable is a function of what normative principles you endorse. It's much easier to resolve disagreements about what counts as good PR. You could probably settle most disagreements about what counts as good PR using polls. 

Maybe for this reason we should expect being into PR to be a relatively stable property of organizations, while being into honor is a fragile and precious thing for an organization. 

This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.

There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it's just the utility function you endorse, and that's up to you. The other black box is the space of programs that you could be. Maybe it's limited by memory, maybe it's limited by run time, or maybe it's any finite state machine with less than 10^20 states, maybe it's python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don't have to be like this.

Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.

I don't think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.

I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).

This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.

Here is an idea for a disagreement resolution technique. I think this will work best:

*with one other partner you disagree with.

*when your the beliefs you disagree about are clearly about what the world is like.

*when your the beliefs you disagree about are mutually exclusive.

*when everybody genuinely wants to figure out what is going on.

Probably doesn't really require all of those though.

The first step is that you both write out your beliefs on a shared work space. This can be a notebook or a whiteboard or anything like that. Then you each write down your credences next to each of the statements on the work space.

Now, when you want to make a new argument or present a new piece of evidence, you should ask your partner if they have heard it before after you present it. Maybe you should ask them questions about it beforehand to verify that they have not. If they have not heard it before, or had not considered it, you give it a name and write it down between the two propositions. Now you ask your partner how much they changed their credence as a result of the new argument. They write down their new credences below the ones they previously wrote down, and write down the changes next to the argument that just got added to the board.

When your partner presents a new argument or piece of evidence, be honest about whether you have heard it before. If you have not, it should change your credence some. How much do you think? Write down your new credence. I don't think you should worry too much about being a consistent Bayesian here or anything like that. Just move your credence a bit for each argument or piece of evidence you have not heard or considered, and move it more for better arguments or stronger evidence. You don't have to commit to the last credence you write down, but you should think at least that the relative sizes of all of the changes were about right. I

I think this is the core of the technique. I would love to try this. I think it would be interesting because it would focus the conversation and give players a record of how much their minds changed, and why. I also think this might make it harder to just forget the conversation and move back to your previous credence by default afterwards.

You could also iterate it. If you do not think that your partner changed their mind enough as a result of a new argument, get a new workspace and write down how much you think they should have change their credence. They do the same. Now you can both make arguments relevant to that, and incrementally change your estimate of how much they should have changed their mind, and you both have a record of the changes.

Load More