Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Human values and preferences are hard to specify, especially in complex domains. Accordingly, much AGI safety research has focused on approaches to AGI design that refer to human values and preferences indirectly, by learning a model that is grounded in expressions of human values (via stated preferences, observed behaviour, approval, etc.) and/or real-world processes that generate expressions of those values. There are additionally approaches aimed at modelling or imitating other aspects of human cognition or behaviour without an explicit aim of capturing human preferences (but usually in service of ultimately satisfying them). Let us refer to all these models as human models.

In this post, we discuss several reasons to be cautious about AGI designs that use human models. We suggest that the AGI safety research community put more effort into developing approaches that work well in the absence of human models, alongside the approaches that rely on human models. This would be a significant addition to the current safety research landscape, especially if we focus on working out and trying concrete approaches as opposed to developing theory. We also acknowledge various reasons why avoiding human models seems difficult.

Problems with Human Models

To be clear about human models, we draw a rough distinction between our actual preferences (which may not be fully accessible to us) and procedures for evaluating our preferences. The first thing, actual preferences, is what humans actually want upon reflection. Satisfying our actual preferences is a win. The second thing, procedures for evaluating preferences, refers to various proxies for our actual preferences such as our approval, or what looks good to us (with necessarily limited information or time for thinking). Human models are in the second category; consider, as an example, a highly accurate ML model of human yes/no approval on the set of descriptions of outcomes. Our first concern, described below, is about overfitting to human approval and thereby breaking its connection to our actual preferences. (This is a case of Goodhart’s law.)

Less Independent Audits

Imagine we have built an AGI system and we want to use it to design the mass transit system for a new city. The safety problems associated with such a project are well recognised; suppose we are not completely sure we have solved them, but are confident enough to try anyway. We run the system in a sandbox on some fake city input data and examine its outputs. Then we run it on some more outlandish fake city data to assess robustness to distributional shift. The AGI’s outputs look like reasonable transit system designs and considerations, and include arguments, metrics, and other supporting evidence that they are good. Should we be satisfied and ready to run the system on the real city’s data, and to implement the resulting proposed design?

We suggest that an important factor in the answer to this question is whether the AGI system was built using human modelling or not. If it produced a solution to the transit design problem (that humans approve of) without human modelling, then we would more readily trust its outputs. If it produced a solution we approve of with human modelling, then although we expect the outputs to be in many ways about good transit system design (our actual preferences) and in many ways suited to being approved by humans, to the extent that these two targets come apart we must worry about having overfit to the human model at the expense of the good design. (Why not the other way around? Because our assessment of the sandboxed results uses human judgement, not an independent metric for satisfaction of our actual preferences.)

Humans have a preference for not being wrong about the quality of a design, let alone being fooled about it. How much do we want to rely on having correctly captured these preferences in our system? If the system is modelling humans, we strongly rely on the system learning and satisfying these preferences, or else we expect to be fooled to the extent that a good-looking but actually bad transit system design is easier to compose than an actually-good design. On the other hand, if the system is not modelling humans, then the fact that its output looks like a good design is better evidence that it is in fact a good design. Intuitively, if we consider sampling possible outputs and condition on the output looking good (via knowledge of humans), the probability of it being good (via knowledge of the domain) is higher when the system’s knowledge is more about what is good than what looks good.

Here is a handle for this problem: a desire for an independent audit of the system’s outputs. When a system uses human modelling, the mutual information between its outputs and the auditing process (human judgement) is higher. Thus, using human models reduces our ability to do independent audits.

Avoiding human models does not avoid this problem altogether. There is still an “outer-loop optimisation” version of the problem. If the system produces a weird or flawed design in sandbox, and we identify this during an audit, we will probably reject the solution and attempt to debug the system that produced it. This introduces a bias on the overall process (involving multiple versions of the system over phases of auditing and debugging) towards outputs that fool our auditing procedure.

However, outer-loop optimisation pressures are weaker, and therefore less worrying, than in-loop optimisation pressures. We would argue that the problem is much worse, i.e., the bias towards fooling is stronger, when one uses human modelling. This is because the relevant optimisation is in-loop instead and is encountered more often.

As one more analogy to illustrate this point, consider a classic Goodhart’s law example of teaching to the test. If you study the material, then take a test, your test score reveals your knowledge of the material fairly well. If you instead study past tests, your test score reveals your ability to pass tests, which may be correlated with your knowledge of the material but is increasingly less likely to be so correlated as your score goes up. Here human modelling is analogous to past tests and actual preferences are analogous to the material. Taking the test is analogous to an audit, which we want to be independent from the study regimen.

Risk from Bugs

We might implement our first AGI system incorrectly in a mundane sense. Specifically, even if we fully develop a theory of safe or aligned AGI, we might fail to implement that theory due to bugs or problems with our implementation techniques. In this case, we would be relatively better off if the mutual information between the AGI’s knowledge and human preferences is low. We expect the system’s behaviour to be dependent on its knowledge in some way, and we expect implementation errors to shift the nature of that dependence away from our intentions and expectations. Incorrect behaviour that depends on human preferences seems more dangerous than incorrect behaviour that does not.

Consider the space of AGI system implementations, under a metric like similarity to an intended design (equivalently: severity of deviation from the design due to bugs). We want all the points near the first AGI system we build to be safe, because we may end up with a slightly different design than intended for reasons such as being confused about what we are doing or making implementation errors.

There are at least three ways in which the risk from bugs can manifest.

Incorrectly Encoded Values

Supposing we intend the first use of AGI to be solving some bounded and well-specified task, but we misunderstand or badly implement it so much that what we end up with is actually unboundedly optimising some objective function. Then it seems better if that objective is something abstract like puzzle solving rather than something more directly connected to human preferences: consider, as a toy example, if the sign (positive/negative) around the objective were wrong.

Manipulation

The earlier arguments for independent audits do not just apply to the specific tasks we would plan to audit, but also to any activities an AGI system might carry out that humans might disapprove of. Examples include finding ways to hack into our supposedly secure systems, hiding its intentions and activity from us, or outright manipulating us. These tasks are much easier with access to a good psychological model of humans, which can be used to infer what mistakes we might make, or what loopholes we might overlook, or how we might respond to different behaviour from the system.

Human modelling is very close to human manipulation in design space. A system with accurate models of humans is close to a system which successfully uses those models to manipulate humans.

Threats

Another risk from bugs comes not from the AGI system caring incorrectly about our values, but from having inadequate security. If our values are accurately encoded in an AGI system that cares about satisfying them, they become a target for threats from other actors who can gain from manipulating the first system. More examples and perspectives on this problem have been described here.

The increased risk from bugs of human modelling can be summarised as follows: whatever the risk that AGI systems produce catastrophic outcomes due to bugs, the very worst outcomes seem more likely if the system was trained using human modelling because these worst outcomes depend on the information in human models.

Less independent audits and the risk from bugs can both be mitigated by preserving independence of the system from human model information, so the system cannot overfit to that information or use it perversely. The remaining two problems we consider, mind crime and unexpected agents, depend more heavily on the claim that modelling human preferences increases the chances of simulating something human-like.

Mind Crime

Many computations may produce entities that are morally relevant because, for example, they constitute sentient beings that experience pain or pleasure. Bostrom calls improper treatment of such entities “mind crime”. Modelling humans in some form seems more likely to result in such a computation than not modelling them, since humans are morally relevant and the system’s models of humans may end up sharing whatever properties make humans morally relevant.

Unexpected Agents

Similar to the mind crime point above, we expect AGI designs that use human modelling to be more at risk of producing subsystems that are agent-like, because humans are agent-like. For example, we note that trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners. A system engineered to predict human preferences well seems strictly more likely to run into problems associated with misaligned sub-agents. (Nevertheless, we think the amount by which it is more likely is small.)

Safe AGI Without Human Models is Neglected

Given the independent auditing concern, plus the additional points mentioned above, we would like to see more work done on practical approaches to developing safe AGI systems that do not depend on human modelling. At present, this is a neglected area in the AGI safety research landscape. Specifically, work of the form “Here’s a proposed approach, here are the next steps to try it out or investigate further”, which we might term engineering-focused research, is almost entirely done in a human-modelling context. Where we do see some safety work that eschews human modelling, it tends to be theory-focused research, for example, MIRI’s work on agent foundations. This does not fill the gap of engineering-focused work on safety without human models.

To flesh out the claim of a gap, consider the usual formulations of each of the following efforts within safety research: iterated distillation and amplification, debate, recursive reward modelling, cooperative inverse reinforcement learning, and value learning. In each case, there is human modelling built into the basic setup for the approach. However, we note that the technical results in these areas may in some cases be transportable to a setup without human modelling, if the source of human feedback (etc.) is replaced with a purely algorithmic, independent system.

Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment. Although they do not rely on human modelling, some of these approaches nevertheless make most sense in a context where human modelling is happening: for example, impact measures seem to make most sense for agents that will be operating directly in the real world, and such agents are likely to require human modelling. Nevertheless, we would like to see more work of all these kinds, as well as new techniques for building safe AGI that does not rely on human modelling.

Difficulties in Avoiding Human Models

A plausible reason why we do not yet see much research on how to build safe AGI without human modelling is that it is difficult. In this section, we describe some distinct ways in which it is difficult.

Usefulness

It is not obvious how to put a system that does not do human modelling to good use. At least, it is not as obvious as for the systems that do human modelling, since they draw directly on sources (e.g., human preferences) of information about useful behaviour. In other words, it is unclear how to solve the specification problem---how to correctly specify desired (and only desired) behaviour in complex domains---without human modelling. The “against human modelling” stance calls for a solution to the specification problem wherein useful tasks are transformed into well-specified, human-independent tasks either solely by humans or by systems that do not model humans.

To illustrate, suppose we have solved some well-specified, complex but human-independent task like theorem proving or atomically precise manufacturing. Then how do we leverage this solution to produce a good (or better) future? Empowering everyone, or even a few people, with access to a superintelligent system that does not directly encode their values in some way does not obviously produce a future where those values are realised. (This seems related to Wei Dai’s human-safety problem.)

Implicit Human Models

Even seemingly “independent” tasks leak at least a little information about their origins in human motivations. Consider again the mass transit system design problem. Since the problem itself concerns the design of a system for use by humans, it seems difficult to avoid modelling humans at all in specifying the task. More subtly, even highly abstract or generic tasks like puzzle solving contain information about the sources/designers of the puzzles, especially if they are tuned for encoding more obviously human-centred problems. (Work by Shah et al. looks at using the information about human preferences that is latent in the world.)

Specification Competitiveness / Do What I Mean

Explicit specification of a task in the form of, say, an optimisation objective (of which a reinforcement learning problem would be a specific case) is known to be fragile: there are usually things we care about that get left out of explicit specifications. This is one of the motivations for seeking more and more high level and indirect specifications, leaving more of the work of figuring out what exactly is to be done to the machine. However, it is currently hard to see how to automate the process of turning tasks (vaguely defined) into correct specifications without modelling humans.

Performance Competitiveness of Human Models

It could be that modelling humans is the best way to achieve good performance on various tasks we want to apply AGI systems to for reasons that are not simply to do with understanding the problem specification well. For example, there may be aspects of human cognition that we want to more or less replicate in an AGI system, for competitiveness at automating those cognitive functions, and those aspects may carry a lot of information about human preferences with them in a hard to separate way.

What to Do Without Human Models?

We have seen arguments for and against aspiring to solve AGI safety using human modelling. Looking back on these arguments, we note that to the extent that human modelling is a good idea, it is important to do it very well; to the extent that it is a bad idea, it is best to not do it at all. Thus, whether or not to do human modelling at all is a configuration bit that should probably be set early when conceiving of an approach to building safe AGI.

It should be noted that the arguments above are not intended to be decisive, and there may be countervailing considerations which mean we should promote the use of human models despite the risks outlined in this post. However, to the extent that AGI systems with human models are more dangerous than those without, there are two broad lines of intervention we might attempt. Firstly, it may be worthwhile to try to decrease the probability that advanced AI develops human models “by default”, by promoting some lines of research over others. For example, an AI trained in a procedurally-generated virtual environment seems significantly less likely to develop human models than an AI trained on human-generated text and video data.

Secondly, we can focus on safety research that does not require human models, so that if we eventually build AGI systems that are highly capable without using human models, we can make them safer without needing to teach them to model humans. Examples of such research, some of which we mentioned earlier, include developing human-independent methods to measure negative side effects, to prevent specification gaming, to build secure approaches to containment, and to extend the usefulness of task-focused systems.

Acknowledgements: thanks to Daniel Kokotajlo, Rob Bensinger, Richard Ngo, Jan Leike, and Tim Genewein for helpful comments on drafts of this post.

New to LessWrong?

New Comment
32 comments, sorted by Click to highlight new comments since: Today at 5:10 PM

Wow, now I take the "But what if a bug puts a negation on the utility function" AGI failure mode more seriously:

One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. (From https://openai.com/blog/fine-tuning-gpt-2/)

Might be worth adding a link to this episode in the text?

That does seem interesting and concerning.

Minor: The link didn’t work for me; in case others have the same problem, here is (I believe) the correct link.

While I don't disagree with the reasoning, I disagree with the main thrust of this post. Under my inside view, I think that we should accept that there will be human models in AI systems and figure out how to deal with them. From an outside view perspective, I agree that work on avoiding human models is neglected, but it seems like this is because it only matters in a very particular set of futures. If you want to avoid human models, it seems that a better approach would be to figure out how to navigate into that set of futures.

Avoiding human models necessarily loses a lot of performance

(This point is similar to the ones made in the "Usefulness" and "Specification Competitiveness" sections, stated more strongly and more abstractly. It may be obvious; if so, feel free to skip to the next section.)

Consider the following framework. There is a very large space of behaviors (or even just goals) that an AI system could have, and there need to be a lot of bits of information in order to select the behavior/goal that we actually want from our AI system. Each bit of information corresponds to halving the space of possible behaviors and goals that the AI system could have, if the AI started out as potentially having any possible behavior/goal. (A more formal treatment would consider the entropy of the distribution over possible behaviors/goals.)

Note that this is a very broad definition of "bits of information about the desired behavior/goal": for example, I think that "ceteris paribus, we prefer low impact actions and plans" counts as a (relatively) small number of bits, and these are the bits that impact measures are working with.

It is also important that the bits of information are interpreted correctly by the AI system. I have said before that I worry that an impact measure strong enough to prevent all catastrophes would probably lead to an AI system that never does anything; in this framework, my concern is that the bits of information provided by an impact measure are being misinterpreted as definitively choosing a particular behavior/goal (i.e. providing the maximum possible number of bits, rather than the relatively small number of bits it should be). I'm more excited about learning from the state of the world because there are more bits (since you can tell which impactful behaviors are good vs. bad), and the bits are interpreted more correctly (since it is interpreted as Bayesian evidence rather than a definitive answer).

In this framework, the most useful AI systems will be the ones that can get and correctly interpret the largest number of bits about what humans want; and behave reasonably with respect to any remaining uncertainty about behavior/goals. But even having a good idea of what the desired goal/behavior is means that you understand humans very well; which means that you are capable of modeling humans, and leads to all of the problems mentioned in this post. (ETA: Note that these could be implicit human models.) So, in order to avoid these problems, you need to have your AI systems have fewer bits about the desired goal/behavior. Such systems will not be nearly as useful and will have artificial upper limits on performance in particular domains. (Compare our current probably-rule-based Siri with something along the lines of Samantha from Her.)

(The Less Independent Audits point cuts against this slightly, but in my opinion not by much.)

Is it okay to sacrifice performance?

While it is probably technically feasible to create AI systems without human models, it does not seem strategically feasible to me. That said, there are some strategic views under which this seems feasible. The key property you need is that we do not build the most useful AI systems before we have solved issues with human models; i.e. we have to be able to sacrifice the competitiveness desideratum.

This could be done with very strong global coordination, but my guess is that this article is not thinking about that case so I'll ignore that possibility. It could also be done by having a single actor (or aligned group of actors) develop AGI with a discontinuous leap in capabilities, and the resulting AGI then quickly improves enough to execute a pivotal act. That actor can then unilaterally decide not to create the most useful AI systems from that point on, and prevent them from having human models.

How does current research on avoiding human models help in this scenario?

If the hope is to prevent human models after the pivotal act, that doesn't seem to rely much on current technical research -- the most significant challenge is in having a value aligned actor create AGI in the first place; after which you could presumably take your time solving AI safety concerns. Of course having some technical research on what to do after the pivotal act would be useful in convincing actors in the first place, but that's a very different argument for the importance of this research and I would expect to do significantly different things to achieve this goal.

That leads me to conclude that this research would be impactful by preventing human models before a pivotal act. This means that we need to create an AI that (with the assistance of humans) executes a plan that leads the humans + AI to take over the world -- but the AI must do this without being able to consider how human society will respond to any action it takes (since that would require human models). This seems to limit you to plans that humans come up with, which can make use of specific narrow "superpowers" (e.g. powerful new technology). This seems to me particularly difficult to accomplish, but I don't have a strong argument for this besides intuition.

It could be that all the other paths seem even more doomed; if that's the main motivation for this then I think that claim should be added somewhere in this post.

Summary

It seems like work on technical AI safety research without human models is especially impactful only in the scenario where a single actor uses the work in order to create an AI system that without modeling humans is able to execute a pivotal act (which usually also rests on assumptions of discontinuous AI progress and/or some form of fast takeoff). This seems particularly unlikely to me. If this is the main motivating scenario, it also places further constraints on technical safety research that avoids human models: the safety measures need to be loose enough that the AI system is still able to help humans execute a pivotal act.

Another plausible scenario would be strong global coordination around not building dangerous AI systems including ones that have world models. I don't have strong inside view beliefs on that scenario but my guess is other people are pessimistic about that scenario.

Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment.

I claim that all of these approaches appear not to rely on human modeling because they are only arguing for safety properties and not usefulness properties, and in order for them to be useful they will need to model humans. (The one exception might be tool AIs + formal specifications, but for the reasons in the parent comment I think that these will have an upper limit on usefulness.)

Re: independent audits, although they're not possible for this particular problem, there are many close variants of this problem such that independent audits are possible. Let's think of human approval as a distorted view of our actual preferences, and our goal is to avoid things which are really bad according to our undistorted actual preferences. If we pass distorted human approval to our AI system, and the AI system avoids things which are really bad according to undistorted human approval, that suggests the system is robust to distortion.

For example:

  • Input your preferences extremely quickly, then see if the result is acceptable when you're given more time to think about it.
  • Input your preferences while drunk, then see if the result is acceptable to your sober self.
  • Tell your friend they can only communicate using gestures. Have a 5-minute "conversation" with them, then go off and input their preferences as you understand them. See if they find the result acceptable.
  • Distort the inputs in code. This lets you test out a very wide range of distortion models and see which produce acceptable performance.

It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.

It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.

One scenario that comes to mind: an agent generates a manipulative output that is optimized to be approved by the programmers while causing the agent to seize control over more resources (in a way that is against the actual preferences of the programmers).

Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment.

Most of these require at least partial specification of human preferences, hence partial modelling of humans: https://www.lesswrong.com/posts/sEqu6jMgnHG2fvaoQ/partial-preferences-needed-partial-preferences-sufficient

Another risk from bugs comes not from the AGI system caring incorrectly about our values, but from having inadequate security. If our values are accurately encoded in an AGI system that cares about satisfying them, they become a target for threats from other actors who can gain from manipulating the first system.

I agree that this is a serious risk, but I wouldn't categorise it as a "risk from bugs". Every actor with goals faces the possibility that other actors may attempt to gain bargaining leverage by threatening to deliberately thwart these goals. So this does not require bugs; rather, the problem arises by default for any actor (human or AI), and I think there's no obvious solution. (I've written about surrogate goals as a possible solution for at least some parts of the problem).

the very worst outcomes seem more likely if the system was trained using human modelling because these worst outcomes depend on the information in human models.

What about the possibility that the AGI system threatens others, rather than being threatened itself? Prima facie, that might also lead to worst-case outcomes. Do you envision a system that's not trained using human modelling and therefore just wouldn't know enough about human minds to make any effective threats? I'm not sure how an AI system can meaningfully be said to have "human-level general intelligence" and yet be completely inept in this regard. (Also, if you have such fine-grained control over what your system does or does not know about, or if you can have it do very powerful things without possessing dangerous kinds of knowledge and abilities, then I think many commonly discussed AI safety problems become non-issues anyway, as you can just constrain the system acccordingly.)

What about the possibility that the AGI system threatens others, rather than being threatened itself? Prima facie, that might also lead to worst-case outcomes.

I think a good intuition pump for this idea is to contrast an arbitrarily powerful paperclip maximizer with an arbitrarily powerful something-like-happiness maximizer.

A paperclip maximizer might resort to threats to get what it wants; and in the long run, it will want to convert all resources into paperclips and infrastructure, to the exclusion of everything humans want. But the "normal" failure modes here tend to look like human extinction.

In contrast, a lot of "normal" failure modes for a something-like-happiness maximizer might look like torture, because the system is trying to optimize something about human brains, rather than just trying to remove humans from the picture so it can do its own thing.

Do you envision a system that's not trained using human modelling and therefore just wouldn't know enough about human minds to make any effective threats? I'm not sure how an AI system can meaningfully be said to have "human-level general intelligence" and yet be completely inept in this regard.

I don't know specifically what Ramana and Scott have in mind, but I'm guessing it's a combination of:

  • If the system isn't trained using human-related data, its "goals" (or the closest things to goals it has) are more likely to look like the paperclip maximizer above, and less likely to look like the something-like-happiness maximizer. This greatly reduces downside risk if the system becomes more capable than we intended.
  • When AI developers build the first AGI systems, the right move will probably be to keep their capabilities to a bare minimum — often the minimum stated in this context is "make your system just capable enough to help make sure the world's AI doesn't cause an existential catastrophe in the near future". If that minimal goal doesn't fluency with certain high-risk domains, then developers should just avoid letting their AGI systems learn about those domains, at least until they've gotten a lot of experience with alignment.

The first developers are in an especially tough position, because they have to act under more time pressure and they'll have very little experience with working AGI systems. As such, it makes sense to try to make their task as easy as possible. Alignment isn't all-or-nothing, and being able to align a system with one set of capabilities doesn't mean you can do so for a system with stronger or more varied capabilities.

If you want to say that such a system isn't technically a "human-level general intelligence", that's fine; the important question is about impact rather than definitions, as long as it's clear that when I say "AGI" I mean something like "system that's doing qualitatively the right kind of reasoning to match human performance in arbitrary domains, in large enough quantities to be competitive in domains like software engineering and theoretical physics", not "system that can in fact match human performance in arbitrary domains".

(Also, if you have such fine-grained control over what your system does or does not know about, or if you can have it do very powerful things without possessing dangerous kinds of knowledge and abilities, then I think many commonly discussed AI safety problems become non-issues anyway, as you can just constrain the system [accordingly].)

Yes, this is one of the main appeals of designing systems that (a) make it easy to blacklist or whitelist certain topics, (b) make it easy to verify that the system really is or isn't thinking about a particular domain, and (c) make it easy to blacklist human modeling in particular. It's a very big deal if you can just sidestep a lot of the core difficulties in AI safety (in your earliest AGI systems). E.g., operator manipulation, deception, mind crime, and some aspects of the fuzziness and complexity of human value.

We don't currently know how to formalize ideas like 'whitelisting cognitive domains', however, and we don't know how to align an AGI system in principle for much more modest tasks, even given a solution to those problems.

Thanks for elaborating. There seem to be two different ideas:

1), that it is a promising strategy to try and constrain early AGI capabilities and knowledge

2), that even without such constraints, a paperclipper entails a smaller risk of worst-case outcomes with large amounts of disvalue, compared to a near miss. (Brian Tomasik has also written about this.)

1) is very plausible, perhaps even obvious, though as you say it's not clear how feasible this will be. I'm not convinced of 2), even though I've heard / read many people expressing this idea. I think it's unclear what would result in more disvalue in expectation. For instance, a paperclipper would have no qualms to threaten other actors (with something that we would consider disvalue), while a near-miss might still have, depending on what exactly the failure mode is. In terms of incidental suffering, it's true that a near-miss is more likely to do something about human minds, but again it's also possible the system is, despite the failure, still compassionate enough to refrain from this, or use digital anesthesia. (It all depends on what plausible failure modes look like, and that's very hard to say.)

I actually think this is pretty wrong (posts forthcoming, but see here for the starting point). You make a separation between the modeled human values and the real human values, but "real human values" are a theoretical abstraction, not a basic part of the world. In other words, real human values were always a subset of modeled human values.

In the example of designing a transit system, there is an unusually straightforward division between things that actually make the transit system good (by concise human-free metrics like reliability or travel time), and things that make human evaluators wrongly think it's good. But there's not such a concise human-free way to write down general human values.

The pitfall of optimization here happens when the AI is searching for an output that has a specific effect on humans. If you can't remove the fact that there is a model of humans involved, then the AI has to be evaluating its output in some other way than modeling the human's reaction to it.

As far as I understand the post, a system that wouldn't contain human values but would still be sufficient to drastically reduce existential risk from AI would not need to execute an action that has a specific effect on humans. If I'm getting the context right, it refers to something like task-directed AGI that would allow the owner to execute a pivotal act – in other words, this is not yet the singleton we want to (maybe) finally build that CEVs us out into the universe, but something that enables us to think long & careful enough to actually build CEV safely (e.g. by giving us molecular nanotechnology or uploading that perhaps doesn't depend on human values, modeled or otherwise).

Or have I misunderstood your comment?

I've heard it claimed that better calibration is not the way to solve AI safety, but it seems like a promising solution to the transit design problem. Suppose we have a brilliant Bayesian machine learning system. Given a labeled dataset of transit system designs we approve/disapprove of, our system estimates the probability that any given model is the "correct" model which separates good designs from bad designs. Now consider two models chosen for the sake of argument: a "human approval" model and an "actual preferences" model. The probability of the "human approval" model will be rated very high. But I'd argue that the probability of the "actual preferences" model will also be rated rather high, because the labeled dataset we provide will be broadly compatible with our actual preferences. As long as the system assigns a reasonably high prior probability to our actual preferences, and the likelihood of the labels given our actual preferences is reasonably high, we should be OK.

Then instead of aiming for a design which is easy to compose, we aim for a design whose probability of being good is maximal when the model gets summed out. This means we're maximizing an objective which includes a wide variety of models which are broadly compatible with the labeled data... including, in particular, our "actual preferences".

In other words, find many reasonable ways of extrapolating the labeled data, and select a transit system which is OK according to all of them. (Or even select a transit system which is OK according to half of them, then use the other half as a test set. Note that it's not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there's some model in the ensemble that also makes that veto.)

I'd argue from a safety point of view, it's more important to have an acceptable transit system than an optimal transit system. Similarly, the goal with our first AGI should be to put the world on an acceptable trajectory, not the optimal trajectory. If the world is on an acceptable trajectory, we can always work to improve things. If the world shifts to an unacceptable trajectory, we may not be able to improve things. So to a first approximation, our first AGI should work to minimize the odds that the world is on an unacceptable trajectory, according to its subjective estimate of what constitutes an unacceptable trajectory.

This is a very carefully reasoned and detailed post, which lays out a clear framework for thinking about approaches to alignment, and I'm especially excited because it points to one quadrant - engineering-focused research without human models - as highly neglected. For these three reasons I've curated the post.

Eliezer's behaviorist genie is explicitly about not modeling humans, and I think falls under "engineering-focused work" although I'm not sure how much work has gone into it aside from that article.

I'm afraid it is generally infeasible to avoid modelling humans at least implicitly. One reason for that is that basically any practical ontology we use is implicitly human. In a sense the only implicitly non-human knowledge is quantum field theory (and even that is not clear).

For example: while human-independent methods to measure negative side effects seem like human-independent, it seems to me lot of ideas about humans creep into the details. The proposals I've seen generally depend on some coarse-graining of states - you at least want to somehow remove time from the state, but generally you do coarse-graining based on ...actually, what humans value. (If this research agenda would be trying to avoid implicit human models, I would expect people spending a lot of effort on measures of quantum entaglement, decoherence, and similar topics.)

The goal is to avoid particular hazards, rather than to make things human-independent as an end in itself. So if we accidentally use a concept of "human-independent" that yields impractical results like "the only safe concepts are those of fundamental physics", we should just conclude that we were using the wrong conception of "human-independent". A good way to avoid this is to keep revisiting the concrete reasons we started down this path in the first place, and see which conceptions capture our pragmatic goals well.

Here are some examples of concrete outcomes that various AGI alignment approaches might want to see, if they're intended to respond to concerns about human models:

  • The system never exhibits thoughts like "what kind of agent built me?"
  • The system exhibits thoughts like that, but never arrives at human-specific conclusions like "my designer probably has a very small working memory" or "my designer is probably vulnerable to the clustering illusion".
  • The system never reasons about powerful optimization processes in general. (In addition to steering a wide berth around human models, this might be helpful for guarding against AGI systems doing some varieties of undesirable self-modification or building undesirable smart successors.)
  • The system only allocates cognitive resources to solving problems in a specific domain like "biochemistry" or "electrical engineering".

Different alignment approaches can target different subsets of those goals, and of many other similar goals, depending on what they think is feasible and important for safety.

As I see it, big part of the problem is there is an inherent tension between "concrete outcomes avoiding general concerns with human models" and "how systems interacting with humans must work". I would expect the more you want to avoid general concerns with human models, the more "impractical" suggestions you get - or in another words, that the tension between the "Problems with h.m." and "Difficulties without h.m." is a tradeoff you cannot avoid by conceptualisations.

I would suggest using grounding in QFT not as an example of obviously wrong conceptualisation, but as useful benchmark of "actually human-model-free". Comparison to the benchmark may then serve as a heuristic pointing to where (at least implicit) human modelling creeps in. In the above mentioned example of avoiding side-effects, the way how the "coarse-graining" of the space is done is actually a point where Goodharting may happen, and thinking in that direction can maybe even lead to some intuitions about how much info about humans got in.

One possible counterargument to the conclusion of the o.p. is that the main "tuneable" parameters we are dealing with are 1. "modelling humans explicitly vs modelling humans implicitly", and II. "total amount of human modelling". Then, it is possible, competitive systems are only in some part of this space. And by pushing hard on the "total amount of human modelling" parameter we can get systems which are doing less human modelling, but when they do it, it is happening mostly in implicit, hard to understand ways.

That all seems generally fine to me. I agree the tradeoffs are the huge central difficulty here; getting to sufficiently capable AGI sufficiently quickly seems enormously harder if you aren't willing to cut major corners on safety.

Human modelling is very close to human manipulation in design space. A system with accurate models of humans is close to a system which successfully uses those models to manipulate humans.

Trying to communicate why this sounds like magical thinking to me... Taylor is a data scientist for the local police department. Taylor notices that detectives are wasting a lot of time working on crimes which never get solved. They want to train a logistic regression on the crime database in order to predict whether a given crime will ever get solved, so detectives can focus their efforts on crimes that are solvable. Would you advise Taylor against this project, on the grounds that the system will be "too close in design space" to one which attempts to commit the perfect crime?

Although they do not rely on human modelling, some of these approaches nevertheless make most sense in a context where human modelling is happening: for example, impact measures seem to make most sense for agents that will be operating directly in the real world, and such agents are likely to require human modelling.

Let's put AI systems into two categories: those that operate in the real world and those that don't. The odds of x-risk from the second kind of system seem low. I'm not sure what kind of safety work is helpful, aside from making sure it truly does not operate in the real world. But if a system does operate in the real world, it's probably going to learn about humans and acquire knowledge about our preferences. Which means you have to solve the problems that implies.

My steelman of this section is: Find a way to create a narrow AI that puts the world on a good trajectory.

I continue to agree with my original comment on this post (though it is a bit long-winded and goes off on more tangents than I would like), and I think it can serve as a review of this post.

If this post were to be rewritten, I'd be particularly interested to hear example "deployment scenarios" where we use an AGI without human models and this makes the future go well. I know of two examples:

  1. We use strong global coordination to ensure that no powerful AI systems with human models are ever deployed.
  2. We build an AGI that can do science / engineering really well (STEM AI), use it to build technology that allows us to take over the world, and then proceed carefully to make the future good.

I don't know if anyone endorses these as plans for the future; either way I have serious qualms with both of them.

I think this post helped draw attention to an under-appreciated strategy/approach to AI safety such that I am very glad that it exists.

This post raises awareness about some important and neglected problems IMO. It's not flashy or mind-blowing, but it's solid.

A toy model I find helpful is correlated vs uncorrelated safety measures. Suppose we have 3 safety measures. Suppose if even 1 safety measure succeeds, our AI remains safe. And suppose each safety measure has a 60% success rate in the event of an accident. If the safety measures are accurately described by independent random variables, our odds of safety in an accident are 1 - 0.4^3 = 94%. If the successes of the safety measures are perfectly correlated, failure of one implies certain failure of the others, and our odds of safety are only 1 - 0.4 = 60%.

In my mind, this is a good argument for working on ideas like safely interruptible agents, impact measures, and boxing. The chance of these ideas failing seems fairly independent from the chance of your value learning system failing.

But I think you could get a similar effect by having your AGI search for models whose failure probabilities are uncorrelated with one another. The better your AGI, the better this approach is likely to work.

[Eli's personal notes. Feel free to ignore or to engage.]

Supposing we intend the first use of AGI to be solving some bounded and well-specified task, but we misunderstand or badly implement it so much that what we end up with is actually unboundedly optimising some objective function. Then it seems better if that objective is something abstract like puzzle solving rather than something more directly connected to human preferences: consider, as a toy example, if the sign (positive/negative) around the objective were wrong.

The basic idea here is that if we screw up so badly that what we thought was a safely bounded tool-AI, is actually optimizing to tile the universe with something, it is better if it tiles the universe with data-centers doing math proofs than something that refers to what humans want?

Why would that be?

because whereas math-proof data-centers might result in our inadvertent death, something that refers to what humans want might result in deliberate torture.

I want to note that either case of screwing up this badly currently feels pretty implausible to me.

[Eli's personal notes. Feel free to ignore or engage.]

We suggest that an important factor in the answer to this question is whether the AGI system was built using human modelling or not. If it produced a solution to the transit design problem (that humans approve of) without human modelling, then we would more readily trust its outputs. If it produced a solution we approve of with human modelling, then although we expect the outputs to be in many ways about good transit system design (our actual preferences) and in many ways suited to being approved by humans, to the extent that these two targets come apart we must worry about having overfit to the human model at the expense of the good design. (Why not the other way around? Because our assessment of the sandboxed results uses human judgement, not an independent metric for satisfaction of our actual preferences.)

Short summary: If an AI system is only modeling the problem that we want it to solve, and it produces a solution that looks good to us, we can be pretty confident that it it is actually a good solution. 

Whereas, if it is modeling some problem, and modeling us, we can't be sure where the solution lies on the spectrum of "actually good" solutions vs. "bad solutions that appear good to us."

Great summary!

So a number of the comments have pointed in the direction of my concerns with what I interpret to be the underlying assumption of this post, namely that it is at all possible to work with something that is not at least touched by humans enough that implicit, partial modeling of humans will not happen as a result of trying to build AI, general or narrow, even if restricted to a small domain. This is not to fail to acknowledge that much current AI safety work tends to be extremely human-centric, going so far as to rely on uniquely human capabilities (at least unique among known things), and that this is in itself a problem for many of the reasons you lay out, but I think it would be a mistake to think we can somehow get away from humans in building AGI.

The reality is that humans are involved in the work of building AGI, involved in the design and construction of the hardware they will run on, the data sets they will use, etc., and even if we think we've removed the latent human-shaped patterns from our algorithms, hardware, and data, we should strongly suspect we are mistaken because humans are tremendously bad at noticing when they are assuming something true of the world when it is actually true of their understanding, i.e. I would expect it to be more likely that humans would fail to notice their latent presence in a "human-model-free" AI than for the AI to actually be free of human modeling.

Thus to go down the direction of working on building AGI without human models risks failure because we failed to deal with the AGI picking up on the latent patterns of humanity within it. This is not to say that we should stick to a human-centric approach, because it has many problems as you've described, but to try to avoid humans is to ignore making our systems robust to the kinds of interference from humans that can push us away from the goal of safe AI, especially unexpected and unplanned for interference due to hidden human influence. If we instead build expecting to deal with and be robust to the influence of humans, we stand a much better chance of producing safe AI than either being human-centric or overly ignoring humans.

Modelling humans in some form seems more likely to result in such a computation than not modelling them, since humans are morally relevant and the system’s models of humans may end up sharing whatever properties make humans morally relevant.

The moral relevance of human intelligence is the first thing I'll think about, I wrote an article about it and as Prof. Gary Francione said:

“[…] cognitive characteristics beyond sentience are morally irrelevant […] being “smart” may matter for some purposes, such as whether we give someone a scholarship, but it is completely irrelevant to whether we use someone as a forced organ donor, as a non-consenting subject in a biomedical experiment.”

Having preferences, desires, interests and acting purposely to achieve them is to attribute a living being with mental states that go beyond the mere ability to feel and perceive things. It goes beyond the accepted definition of “sentience”.  Yet, it seems obvious that not all species possess these attributes in equal degrees.

By the way, Asimov's "Three laws" – the first attempt to solve AI safety – are not modelling humans in explicit way. They need only the concepts of "harm" and "injury". But the robot may still have the need to create a human model to predict which human actions will cause injury.