Wiki Contributions


It doesn't matter how many fake versions of you hold the wrong conclusion about their own ontological status, since those fake beliefs exist in fake versions of you. The moral harm caused by a single real Chantiel thinking they're not real is infinitely greater than infinitely many non-real Chantiels thinking they are real.

Interesting. When you say "fake" versions of myself, do you mean simulations? If so, I'm having a hard time seeing how that could be true. Specifically, what's wrong about me thinking I might not be "real"? I mean, if I though I was in a simulation, I think I'd do pretty much the same things I would do if I thought I wasn't in a simulation. So I'm not sure what the moral harm is.

Do you have any links to previous discussions about this?

"If the real Chantiel is so correlated with you that they will do what you will do, then you should believe you're real so that the real Chantiel will believe they are real, too. This holds even if you aren't real."

By "real", do you mean non-simulated? Are you saying that even if 99% of Chantiels in the universe are in simulations, then I should still believe I'm not in one? I don't know how I could convince myself of being "real" if 99% of Chantiels aren't.

Do you perhaps mean I should act as if I were non-simulated, rather than literally being non-simulated?

Thanks for the response, Gwern.

he is explicit that the minds in the simulation may be only tenuously related to 'real'/historical minds;

Oh, I guess I missed this. Do you know where Bostrom said the "simulations" can only tenuously related to real minds? I was rereading the paper but didn't see mention of this. I'm just surprised, because normally I don't think zoo-like things would be considered simulations.

This falls under either #1 or #2, since you don't say what human capabilities are in the zoo or explain how exactly this zoo situation matters to running simulations; do we go extinct at some time long in the future when our zookeepers stop keeping us alive (and "go extinct before reaching a “posthuman” stage"), having never become powerful zookeeper-level civs ourselves, or are we not permitted to ("extremely unlikely to run a significant number of simulations")?

In case I didn't make it clear, I'm saying that even if a significant proportion of civilization reach a post-human stage and a significant proportion of these run simulations, there would still potentially be a non-small chance of actually not being in a simulation an instead being in a game or zoo. For example, suppose each post-human civilization makes 100 proper simulations and 100 zoos. Then even if parts 1 and 2 of the simulation argument are true, you still have a 50% chance of ending up in a zoo.

Does this make sense?

I've realized I'm somewhat skeptical of the simulation argument.

The simulation argument proposed by Bostrom argued, roughly, that either almost exactly all Earth-like worlds don't reach a posthuman level, almost exactly all such civilizations don't go on to build many simulations, or that we're almost certainly in a simulation.

Now, if we knew that the only two sorts of creatures that experience what we experience are either in simulations or the actual, original, non-simulated Earth, then I can see why the argument would be reasonable. However, I don't know how we could know this.

For example, consider zoos: Perhaps advanced aliens create "zoos" featuring humans in an Earth-like world, for their own entertainment or other purposes. These wouldn't necessarily be simulations of any actual other planet, but might merely have been inspired by actual planets. Similarly, lions in the zoo are similar to lions in the wild, and their enclosure features plants and other environmental feature similar to what they would experience in the wild. But I wouldn't call lions in zoos simulations of wild lions, even if the developed parts where humans could view them was completely invisible to them and their enclosure was arbitrarily large.

Similarly, consider games: Perhaps aliens create games or something like them set in Earth-like worlds that aren't actually intended to be simulations of any particle world. Similarly, human fantasy RPGs often have a medieval theme, so maybe aliens would create games set in a modern-Earth-like world, without having in mind any actual planet to simulate.

Now, you could argue that in an infinite universe, these things are all actually simulations, because there must be some actual, non-simulated world that's just like the "zoo" or game. However, by that reasoning, you could argue that a rock you pick up is nothing but a "rock simulation" because you know there is at least one other rock in the universe with the exact same configuration and environment as the rock you're holding. That doesn't seem right to me.

Similarly, you could say, then, that I'm actually in a simulation right now. Because even if I'm in the original Earth, there is some other Chantiel in the universe in a situation identical to my current one, who is logically constrained to do the same thing I do, so thus I am a simulation of her. And my environment is thus a simulation of hers.

For robustness, you have a dataset that's drawn from the wrong distribution, and you need to act in a way that you would've acted if it was drawn from the correct distribution. If you have an amplification dynamic that moves models towards few attractors, then changing the starting point (training distribution compared to target distribution) probably won't matter. At that point the issue is for the attractor to be useful with respect to all those starting distributions/models. This doesn't automatically make sense, comparing models by usefulness doesn't fall out of the other concepts.

Interesting. Do you have any links discussing this? I read Paul Christiano's post on reliability amplification, but couldn't find mention of this. And, alas, I'm having trouble finding other relevant articles online.

Amplification induces a dynamic in the model space, it's a concept of improving models (or equivalently in this context, distributions). This can be useful when you don't have good datasets, in various ways. Also it ignores independence when talking about recomputing things

Yes, that's true. I'm not claiming that iterated amplification doesn't have advantages. What I'm wondering is if non-iterated amplification is a viable alternative. I haven't seen non-iterated amplification proposed before for creating algorithm AI. Amplification without iteration has the disadvantage that it may not have the attractor dynamic iterated amplification has, but it also doesn't have the exponentially increasing unreliability iterated amplification has. So, to me at least, it's not clear to me if pursuing iterated amplification is a more promising strategy than amplification without iteration.

I've been thinking about what you've said about iterated amplification, and there are some things I'm unsure of. I'm still rather skeptical of the benefit of iterated amplification, so I'd really appreciate a response.

You mentioned that iterated amplification can be useful when you have only very limited, domain-specific models of human behavior, where such models would be unable to come up with the ability to create code. However, there are two things I'm wondering about. The first is that it seems to me that, for a wide range of situations, you need a general and robustly accurate model of human behavior to perform well. The second is that, even if you don't have a general model of human behavior, it seems to me that it's sufficient to only have one amplification step, which I suppose isn't iterated amplification. And the big benefit to avoiding iterated amplification is that iterated amplification results in exponential decreases in reliability from compounding errors on each distillation step, but with a single amplification step, this exponential decrease in reliability wouldn't occur.

For the first topic, suppose your AI is trained to make movies. I think just about every human value is relevant to the creation of movies, because humans usually like movies with a happy ending, and to make an ending happy you need to understand what humans consider a "happy ending".

Further, you would need an accurate model of human cognitive capabilities. To make a good movie, it needs to be easy enough for humans to understand. But sometimes it also shouldn't be too easy, because that can remove the mystery of it.

And the above is not just true for movies: I think creating other forms of entertainment would involve the same things as above.

Could you do the above with only some domain-limited model of what counts as confusing or a good or bad ending in the context of movies? It's not clear to me that this is possible. Movies involve a very wide variety of situations, and you need to keep things understandable and resulting in a happy ending in all of those circumstances. I don't see how could you robustly do the above without a general model of what people people find confusing or otherwise bad.

Further, whenever an AI needs to explain something to humans, it seems to me that it's important that it has an accurate model of what humans can understand and not understand. Is there any way to do this with purely domain-specific models rather than with a general understanding of what people find confusing? It's not clear to me that this is possible. For example, imagine an AI that needs to explain many different things. Maybe it's tasked with creating learning materials or making the news. With such a broad category of things the AI needs to explain, it's really not clear to me how an AI could do this without a general model of what makes things confusing or not.

Also more generally, it seems to me that whenever the AI is involved with human interaction in novel circumstances, it will need an accurate model of what people like and dislike. For example, consider an AI tasked with coming up with a plan for human workers. Doing so has the potential to involve an extremely wide range of values. For example, humans generally value novelty, autonomy, not feeling embarrassed, not being bored, not being overly pressured, not feeling offended, and not seeing disgusting or ugly things.

Could you have an AI learn to avoid things things with only domain-specific models, rather than a general understanding of what people value and disvalue? I'm not sure how to do this. Maybe you could learn models that work for reflecting people's values in limited circumstances. However, I think an essential component of intelligence is to come up with novel plans involving novel situations. And I don't see how an agent could do this without a general understanding of values. For example, the AI might create entire new industries, and it would be important that any human workers in those industries would have satisfactory conditions.

Now, for the second topic: using amplification without iteration.

First off, I want to note that, even without a general model of humans, it's still not really clear to me that you need any amplification at all. As I've said before, even mere human imitation the potential to result in extremely high intelligence simply by doing the same things humans do, but much faster. As I mentioned previously, consider the human output to be published research papers from top researchers, and the AI is tasked with mimicking it. Then the AI could take the research papers as the human output and use this to create future papers but far far faster.

But suppose you do still need amplification. Then I don't see why one amplification step wouldn't be enough. I think that if you put together a sufficiently large number of intelligent humans and give them unlimited time to think, they'd be able to solve pretty much anything that iterated amplification with HCH would be able to solve. So, instead of having multiple amplification and distillation steps, you could instead just have one very large amplification step that would involve a large enough number of humans models interacting that it could solve pretty much anything.

If the amplification step involve a sufficiently large number of people, you might be concerned that it would be intractable to emulate them all.

I'm not sure if this would be a problem. Consider again the AI designed to mimic the research papers of top researchers. I think that often a small number of top researchers are responsible for a large proportion of research progress, so the AI could potentially just see that output of the top, say, 100 or 1000 researchers working together would be. And the AI would potentially be able to produce the outputs of each researcher with far less computation. That sounds plausibly like enough to me.

But suppose that's not enough, and emulating every human individually during the amplification step is intractable. Then here's how I think you can get around this: train not only a human model, but also a system of approximating the output of an expensive computation with much lower computational cost. Then, for the amplification step, you can define an computing involving an extremely large number of interacting emulated humans, and then allow the approximation system to come up with approximations to this without needing to directly emulate every human.

To give a sense of how this might work, note that in a computation, often a small amount of the parts of the computation account for a large part of the output. For example, if you are trying to approximate a computation about gravity, commonly only the closest, most massive objects have significant gravitational effect on something, and you can ignore the rest. Similarly, rather than simulate individual atoms, it's much more efficient to come up with groups of large number of atoms, and consider their effect as a group. The same is true for other computations involving many small components.

To emulate humans, you could potentially do the same things as you would when simulating gravity. Specifically, an AI may be able to consider groups of humans and infer what the final output of that group will be, without actually needing to emulate each one individually. Further, for very challenging topics, many people may fail to contribute anything to the final result, so the could potentially avoid emulating them at all.

So I still can't really see the benefit of iterated amplification. Of course, I could be missing something, so I'm interesting in hearing what you think.

One potential problem is that it might be hard to come up with good training data for an arbitrary-function-approximator, since finding the exact output of expensive functions would be expensive. However, it's not clear to me how big of a problem this would be. As I've said before, even the output of a 100 or 1000 humans interacting could potentially be all the AI ever needs, and with sufficient fast approximations of individual humans, this could be tractable to create training data for.

Further, I bet the AI could learn a lot about arbitrary-function approximation just by training on approximating functions that are already reasonably fast the compute. I think the basic techniques to quickly approximating functions are what I mentioned before: come up with abstract objects that involve groups of individual components, and know when to stop performing the computation on a certain object because it's clear it will have little effect on the final result.

I hadn't fully appreciated to difficultly that could result from AIs having alien concepts, so thanks for bringing it up.

However, it seems to me that this would not be a big problem, provided the AI is still interpretable. I'll provide two ways to handle this.

For one, you could potentially translate the human concepts you care about into statements using the AI's concepts. Even if the AI doesn't use the same concepts people do, AIs are still incentivized to form a detailed model of the world. If you can have access to all the AI's world model, but still can't figure out basic things like if the model means the world gets destroyed or the AI takes over the world, then that model doesn't seem very interperable. So I'm skeptical that this would really be a problem.

But, if it is, it seems to me that there's a way to get the AI to have non-alien concepts.

In a comment with another person, made a modification to the system by saying that the people outputting utilities should be able to refuse to output one in a given query, for example because the situation is too complicated or to vague for humans to understand that desirability of. This could potentially allow for people to avoid having the AI from having very aliens concepts.

To deal with alien concepts, you can just have the people refuse to provide an answer to the utility of a possible for description if the description is described. This way, the AI would need to come up with sufficiently non-alien concepts before it can understand the utility of things. The AI would have to come up with reasonably non-alien concepts in order to get any of its calls to its utility function to work.

Another problem is that the system cannot represent and communicate the whole predicted future history of the universe to us.

This is a good point and one that I, foolishly, hadn't considered.

However, it seems to me that there is a way to get around this. Specifically, just provide the query-answerers the option to refuse to evaluate the utility of a description of a possible future. If this happens, the AI won't be able to have its utility function return a value for such a possible future.

To see how to do this, note that if a description of a possible future world is too large for the human to understand, then the human can refuse to provide a utility for it.

Similarly, if the description of the future doesn't specify the future with sufficient detail that the person can clearly tell if the described outcome would be good, then the person can also refuse to return a value.

For example, suppose you are making an AI designed to make paperclips. And suppose the AI queries the person asking for the utility of the possible future described by, "The AI makes a ton of paperclips". Then the person could refuse to answer, because the description is insufficient to specify the quality of the outcome, for example, because it doesn't say whether or not Earth got destroyed.

Instead, a possible future would only be rated as high utility if it says something like,"The AI makes a ton of paperclips, and the world isn't destroyed, and the AI doesn't take over the world, and no creatures get tortured anywhere in our Hubble sphere, and creatures in the universe are generally satisfied".

Does this make sense?

I, of course, could always be missing something.

(Sorry for the late response)

Sorry for taking a ridiculously long time to get back to you. I was dealing with some stuff.

This works great when you can recognize good things within the represention the AI uses to think about the world. But what if that's not true?

Yes, that is correct. As I said in the article, a high degree of interpretability is necessary to use the idea.

It's true that interpretability is required, but the key point of my scheme is this: interpretability is all you need for intent alignment, provided my scheme is correct. I don't know of any other alignment strategies for which which this is the case. So, my scheme, if correct, basically allows you to bypass what is plausibly the hardest part of AI safety: robust value-loading.

I know of course that I could be wrong about this, but if the technique is correct, it seems like a quite promising AI safety technique to me.

Does this seem reasonable? I may very well be just be misunderstanding or missing something.

I've made a few posts that seemed to contain potentially valuable ideas related to AI safety. However, I got almost no feedback on them, so I was hoping some people could look at them and tell me what they think. They still seem valid to me, and if they are, they could potentially be very valuable contributions. And if they aren't valid, then I think knowing the reason for this could potentially help me a lot in my future efforts towards contributing to AI safety.

The posts are:

  1. My critique of a published impact measure.
  2. Manual alignment
  3. Alignment via reverse engineering
Load More