Just someone who wants to learn about the world.
I change my views often. Anything I wrote that's more than 10 days old should be treated as potentially outdated.
For the Alignment Newsletter:
Summary: A deep reinforcement learning agent trained by reward samples alone may predictably lead to a proxy alignment issue: the learner could fail to develop a full understanding of what behavior it is being rewarded for, and thus behave unacceptably when it is taken off its training distribution. Since we often use explicit specifications to define our reward functions, Evan Hubinger asks how we can incorporate this information into our deep learning models so that they remain aligned off the training distribution. He names several possibilities for doing so, such as giving the deep learning model access to a differentiable copy of the reward function during training, and fine-tuning a language model so that it can map natural language descriptions of a reward function into optimal actions.
Opinion: I'm unsure, though leaning skeptical, whether incorporating a copy of the reward function into a deep learning model would help it learn. My guess is that if someone did that with a current model it would make the model harder to train, rather than making anything easier. I will be excited if someone can demonstrate at least one feasible approach to addressing proxy alignment that does more than sample the reward function.
women and people from all "races" fought for their rights, when pigs and cattle will stand and let us know they've had enough, it'll be time to consider the question.
To generalize the principle you have described, should we never give a group moral consideration unless they can advocate for themselves? If we adopted this principle, then young children would immediately lose moral consideration, as would profoundly mentally disabled people.
Thanks for the elaboration. You quoted Robin Hanson as saying
There are many other factors that influence coordination, after all; even perfect value matching is consistent with quite poor coordination.
My model says that this is about right. It generally takes a few more things for people to cooperate, such as common knowledge of perfect value matching, common knowledge of willingness to cooperate, and an understanding of the benefits of cooperation.
By assumption, AIs will become smarter than humans, which makes me think they will understand the benefits of cooperation better than we do. But this understanding won't be gained "all at once" but will instead be continuous with the past. This is essentially why I think cooperation will be easier in the future, but that it will more-or-less follow a gradual transition from our current trends (I think cooperation has been increasing globally in the last few centuries anyway, for similar reasons).
Abstracting away from the specific mechanism, as a more general argument, AI designers or evolution will (sooner or later) be able to explore a much larger region of mind design space than biological evolution could. Within this region there are bound to be minds much better at coordination than humans, and we should certainly expect coordination ability to be one objective that AI designers or evolution will optimize for since it offers a significant competitive advantage.
I agree that we will be able to search over a larger space of mind-design, and I also agree that this implies that it will be easier to find minds that cooperate.
I don't agree that cooperation necessarily allows you to have a greater competitive advantage. It's worth seeing why this is true in the case of evolution, as I think it carries over to the AI case. Naively, organisms that cooperate would always enjoy some advantages, since they would never have to fight for resources. However, this naive model ignores the fact that genes are selfish: if there is a way to reap the benefits of cooperation without having to pay the price of giving up resources, then organisms will pursue this strategy instead.
This is essentially the same argument that evolutionary game theorists have used to explain the evolution of aggression, as I understand it. Of course, there are some simplifying assumptions which could be worth disputing.
Summary: There are a number of differences between how humans cooperate and how hypothetical AI agents could cooperate, and these differences have important strategic implications for AI forecasting and safety. The first big implication is that AIs with explicit utility functions will be able to merge their values. This merging may have the effect of rending laws and norms obsolete, since large conflicts would no longer occur. The second big implication is that our approaches to AI safety should preserve the ability for AIs to cooperate. This is because if AIs *don't* have the ability to cooperate, they might not be as effective, as they will be outcompeted by factions who can cooperate better.
Opinion: My usual starting point for future forecasting is to assume that AI won't alter any long term trends, and then update from there on the evidence. Most technologies haven't disrupted centuries-long trends in conflict resolution, which makes me hesitant to accept the first implication. Here, I think the biggest weakness in the argument is the assumption that powerful AIs should be described as having explicit utility functions. I still think that cooperation will be easier in the future, but it probably won't follow a radical departure from past trends.
I'm not sure this is an accurate characterization of the point; my understanding is that the concern largely comes from the possibility that the growth will be faster than exponential
Sure, if someone was arguing that, then they have a valid understanding of the difference between continuous vs. discontinuous takeoff. I would just question the assumption why we should expect growth to be faster than exponential for any sustained period of time.
I has the impression that when people write about the history of science, they tend to write about who discovered what, and how, but very little about what happened to the idea afterwards. I don't expect to find many chapters of history of science titled, "How scientific theories are spread through the world" but it's trivial to find an equivalent chapter on religion, or say, communism.
Bertrand Russell's advice to future generations, from 1959
Interviewer: Suppose, Lord Russell, this film would be looked at by our descendants, like a Dead Sea scroll in a thousand years’ time. What would you think it’s worth telling that generation about the life you’ve lived and the lessons you’ve learned from it?
Russell: I should like to say two things, one intellectual and one moral. The intellectual thing I should want to say to them is this: When you are studying any matter or considering any philosophy, ask yourself only what are the facts and what is the truth that the facts bear out. Never let yourself be diverted either by what you wish to believe, or by what you think would have beneficent social effects if it were believed, but look only — and solely — at what are the facts. That is the intellectual thing that I should wish to say. The moral thing I should wish to say to them is very simple: I should say love is wise, hatred is foolish. In this world which is getting more and more closely interconnected, we have to learn to tolerate each other; we have to learn to put up with the fact that some people say things we don’t like. We can only live together in that way and if we are to live together and not die together, we must learn a kind of charity and a kind of tolerance, which is absolutely vital to the continuation of human life on this planet.
It is generally accepted that there is no experimental evidence that could determine between macroscopic decoherence and the collapse postulate.
From the Stanford Encyclopedia of Philosophy,
It has frequently been claimed, e.g. by De Witt 1970, that the MWI is in principle indistinguishable from the ideal collapse theory. This is not so. The collapse leads to effects that do not exist if the MWI is the correct theory. To observe the collapse we would need a super technology which allows for the “undoing” of a quantum experiment, including a reversal of the detection process by macroscopic devices. See Lockwood 1989 (p. 223), Vaidman 1998 (p. 257), and other proposals in Deutsch 1986. These proposals are all for gedanken experiments that cannot be performed with current or any foreseeable future technology. Indeed, in these experiments an interference of different worlds has to be observed. Worlds are different when at least one macroscopic object is in macroscopically distinguishable states. Thus, what is needed is an interference experiment with a macroscopic body. Today there are interference experiments with larger and larger objects (e.g., fullerene molecules C70, see Brezger et al. 2002 ), but these objects are still not large enough to be considered “macroscopic”. Such experiments can only refine the constraints on the boundary where the collapse might take place. A decisive experiment should involve the interference of states which differ in a macroscopic number of degrees of freedom: an impossible task for today's technology. It can be argued, however, that the burden of an experimental proof lies with the opponents of the MWI, because it is they who claim that there is a new physics beyond the well tested Schrödinger equation.
Other than, say looking at our computers and comparing them to insects, what other signposts should we look for, if we want to calibrate progress towards domain-general artificial intelligence?