The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?".
I think that it generally seems like a good idea to have solid theories of two different things:
I read your above paragraph as maligning (1) in favor of (2). In order to reinforce the desired shards, it seems helpful to have some idea of what those look like.
For example, if we avoid fancy philosophical frameworks, we might think a good way to avoid wireheading is to introduce negative examples where the AI manipulates circuitry to boost reinforcement signals, and positive examples where the AI doesn't do that when given the opportunity. After doing some philosophy where you try to positively specify what you're trying to train, it's easier to notice that this sort of training still leaves the human-manipulation failure mode open.
After doing this kind of philosophy for a while, it's intuitive to form the more general prediction that if you haven't been able to write down a formal model of the kind of thing you're trying to teach, there are probably easy failure modes like this which your training hasn't attempted to rule out at all.
The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.[...]In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.
In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people).
Thinking about this now, I think maybe it's a question of precautions, and what order you want to teach things in. Very similarly to the argument that you might want to make a system corrigible first, before ensuring that it has other good properties -- because if you make a mistake, later, a corrigible system will let you correct the mistake.
Similarly, it seems like a sensible early goal could be 'get the system to understand that the sort of thing it is trying to do, in (value) learning, is to pick up human values'. Because once it has understood this point correctly, it is harder for things to go wrong later on, and the system may even be able to do much of the heavy lifting for you.
Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn't seem like a very correctible approach. (I think much of this pessimism comes from mentally visualizing humans arguing about what object-level values to try to teach an AI. Even if the humans are able to agree, I do not feel especially optimistic about their choices, even if they're supposedly informed by neuroscience and not just moral philosophers.)
If you commit to the specific view of outer/inner alignment, then now you also want your loss function to "represent" that goal in some way.
I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases -- or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.
This is because writing down the desired inductive biases as an explicit prior can help us to understand what's going on better.
It's tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence.
And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think "utility function" is not a very good way to think about values because what is it a function of; we don't have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the "values/preferences" representation, without worrying about what underlying utility function generates those expected values.
(I do not take the above to be a knockdown argument against "committing to the specific division between outer and inner alignment steers you wrong" -- I'm just saying things that seem true to me and plausibly relevant to the debate.)
I doubt this due to learning from scratch.
I expect you'll say I'm missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is "figuring out the human prior", because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I'm fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.
Insofar as the question makes sense, its answer probably takes the form of inductive biases: I might learn to predict the world via self-supervised learning and form concepts around other people having values and emotional states due to that being a simple convergent abstraction relatively pinned down by my training process, architecture, and data over my life, also reusing my self-modelling abstractions.
I am totally fine with saying "inductive biases" instead of "prior"; I think it indeed pins down what I meant in a more accurate way (by virtue of, in itself, being a more vague and imprecise concept than "prior").
I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?".
I think I don't understand what you mean by (2), and as a consequence, don't understand the rest of this paragraph?
WRT (1), I don't think I was being careful about the distinction in this post, but I do think the following:
The problem of wireheading is certainly not that RL agents are trying to take control of their reward feedback by definition; I agree with your complaint about Daniel Dewey as quoted. It's a false explanation of why wireheading is a concern.
The problem of wireheading is, rather, that none of the feedback the system gets can disincentivize (ie, provide differentially more loss for) models which are making this mistake. To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models (assuming similar competence levels in both, of course, which I admit is a bit fuzzy).
This doesn't seem relevant for non-AIXI RL agents which don't end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?
With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.
Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can't distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate.
In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won't, unless of course they're just not very good at their jobs.
So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all?
Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)
I'm a bit uncomfortable with the "extreme adversarial threats aren't credible; players are only considering them because they know you'll capitulate" line of reasoning because it is a very updateful line of reasoning. It makes perfect sense for UDT and functional decision theory to reason in this way.
I find the chicken example somewhat compelling, but I can also easily give the "UDT / FDT retort": since agents are free to choose their policy however they like, one of their options should absolutely be to just go straight. And arguably, the agent should choose that, conditional on bargaining breaking down (precisely because this choice maximizes the utility obtained in fact -- ie, the only sort of reasoning which moves UDT/FDT). Therefore, the coco line of reasoning isn't relying on an absurd hypothetical.
Another argument for this perspective: if we set the disagreement point via Nash equilibrium, then the agents have an extra incentive to change their preferences before bargaining, so that the Nash equilibrium is closer to the optimal disagreement point (IE the competition point from coco). This isn't a very strong argument, however, because (as far as I know) the whole scheme doesn't incentivize honest reporting in any case. So agents may be incentivised to modify their preferences one way or another.
One simple idea: the disagreement point should reflect whatever really happens when bargaining breaks down. This helps ensure that players are happy to use the coco equilibrium instead of something else, in cases where "something else" implies the breakdown of negotiations. (Because the coco point is always a pareto-improvement over the disagreement point, if possible -- so choosing a realistic disagreement point helps ensure that the coco point is realistically an improvement over alternatives.)
However, in reality, the outcome of conflicts we avoid remain unknown. The realist disagreement point is difficult to define or measure if in reality agreement is achieved.
So perhaps we should suppose that agreement cannot always be reached, and base our disagreement point on the observed consequences of bargaining failure.
There are two questions to ask:How does the AI learn to care about this?What do we gain by making the AI care about this?If we don't discuss 100% answers, it's very important to evaluate all those questions in context of each other. I don't know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
There are two questions to ask:
If we don't discuss 100% answers, it's very important to evaluate all those questions in context of each other. I don't know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
I agree with the overall argument structure to some extent. IE, in general, we should separate the question of what we gain from X from the question of how to achieve it, and not having answered one of those questions should not block us from considering the other.
However, to me, your "what do we gain" claims are already established to be quite large. In the dialogues (about candy and movement), it seems like the idea is that everything works out nicely, in full generality. You aren't just claiming a few good properties; you seem to be saying "and so on".
(To be more specific to avoid confusion, you aren't only claiming that valuing candy doesn't result in killing humans or hacking human values. You also seem to be saying that valuing candy in this way wouldn't throw away any important aspect of human values at all. The candy-AI wouldn't set human quality of life to dirt-poor levels, even if it were instrumentally useful for diverting resources to ensure the daily availability of candy. The AI also wouldn't allow a preventable hostile invasion by candy-loving aliens-which-count-as-humans-by-some-warped-definition. etc etc etc)
Therefore, in this particular case, I have relatively little interest in further elaborating the "what do we gain" side of things. The "how are we supposed to gain it" question seems much more urgent and worthy of discussion.
To use an analogy, if you told me that they knew a quick way to make $20, I might ask "why are we so worried about getting $20?". But if you tell me you know a quick way to make a billion dollars, I'm going to be much less interested in the "why" question and much more interested in the "how" question.
I don't know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
TBH, I don't really believe this is true, because I don't think you've pinned down what "this" even is. IE, we can expand your set of two questions into three:
You've labeled X with terms like "reward economics" and "money system", but you haven't really defined those things. So your arguments about what we can gain from them are necessarily vague. As I mentioned before, the general idea of assigning a value (price) to everything is fully compatible with utility theory, but obviously you also further claim that your approach is not identical to utility theory. I hope this point helps illustrate why I feel your terms are still not sufficiently defined.
(My earlier question took the form of "how do we get X", but really, that's because I was replying to a specific point rather than starting at the beginning. What I most need to understand better at the moment is 'what is X, even?'.)
The point of my idea is that "human (meta-)ethics" is just a subset of a way broader topic. You can learn a lot about human ethics and the way humans expect you to fulfill their wishes before you encounter any humans or start to think about "values". So, we can replace the questions "how to encode human values?" and even "how to learn human values?" with more general questions "how to learn (properties of systems)?" and "how to translate knowledge about (properties of systems) to knowledge about human values?"
We have already to some extent replaced the question "how do you learn human values?" with the question "how do we robustly point at anything external to the system, at all?". One variation of this which we often consider is "how can a system reliably parse reality into objects" -- this is like John Wentworth's natural abstraction program.
I don't know whether you think this is at all in the right direction (I'm not trying to claim it's identical to your approach or anything like that), but it currently seems to me more concrete and well-defined than your "how to learn properties of systems".
with more general questions "how to learn (properties of systems)?"
The way you bracket this suggests to me that you think "how to learn" is already a fair summary, and "properties of systems" is actually pointing at something extremely general. Like, maybe "properties of systems" is really a phrase that encompasses everything you can learn?
If this were the correct interpretation of your words, then my response would be: I'm not going to claim that we've entirely mastered learning, but it seems surprising to claim that studying how we learn about the properties of very simple systems (systems that we can already learn quite easily using modern ML?) would be the key.
In your proposal about normativity you do a similar "trick"[...]I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system).
In your proposal about normativity you do a similar "trick"
I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system).
Since you are relating this to my approach: I would say that the critical difference, for me, is precisely the human involvement (or more generally, the involvement of many capable agents). This creates social equilibria (and non-equilibrium behaviors) which form the core of normativity.
An abstract decision-theoretic agent has no norms and no need for norms, in part because it treats its environment as nonliving, nonthingking, and entirely external. A single person existing over time already has a need for norms, because coordinating with yourself over time is hard.
But any system which contains agents is not "simple". Or at least, I don't understand the sense in which it is simple.
I think it's a different approach, because we don't have to start with human values (we could start with trying to fix universal AI "bugs") and we don't have to assume optimization.
I don't understand what you mean about not assuming optimization. But, I would object that the approach I mentioned (learning values from the environment) doesn't need to "start with human values" either. Hypothetically, you could try an approach like this with no preconceived concept of "human" at all; you just make a generic assumption that the environments you encounter have been optimized to a significant extent (by some as-yet-unknown actor).
Notably, this approach would have the obvious risk of the AI deciding that too many of the properties of the current world are "good" (for example, people dying, people suffering). On my understanding, your current proposal also suffers from this critique. (You make lots of arguments about how your ideas might help the AI to decide not to change things about the world; you make few-to-no arguments about such an AI deciding to actually improve the world in some way. Well, on my understanding so far.)
However, not killing all humans is such a big win that we can ignore small issues like that for now. Returning to my earlier analogy, the first question that occurs to me is where the billion dollars is coming from, not whether the billion will be enough.
I explained how I want to combine those in the context of "What do we gain by caring about system properties?" question.
In the context you're replying to, I was trying to propose more concrete ideas for your consideration, as opposed to reiterating what you said.
Here I'm trying to do the same trick I did before: split a question, find the easier part, attack the harder part through the easier one.
Although this will be appropriate (even necessary!) in some cases, the trick is a dangerous one in general. Often you want to tackle the harder sub-problems first, so that you fail as soon as possible. Otherwise, you can spend years on a research program that splits off the easiest fractions of your grand plan, only to realize later that the harder parts of your plan were secretly impossible. So the strategy sets you up to potentially waste a lot of time!
Maybe it's useful to split the knowledge about systems into 3 parts:Absolute knowledge: e.g. "taking absolute control of the system will destroy its (X) property", "destroying the (X) property of the system may be bad". This knowledge connects abstract actions to simple facts and tautologies.Experience of many systems: e.g. "destroying the (X) property of this system is likely to be bad because it's bad for many other systems" or "destroying (X) is likely to be bad because I'm 90% sure human doesn't ask me to do the type of task where destroying (X) is allowed".Biases of a specific system: e.g. "for this specific system, "absolute control" means controlling about 90% of it". This knowledge maps abstract actions/facts onto the structure of a specific system.
Maybe it's useful to split the knowledge about systems into 3 parts:
I don't really understand the motivation behind this division, but, it sounds to me like you require normative feedback to learn these types of things. You keep saying things like "is likely to be bad" and "is likely to be good". But it's difficult to see how to derive ideas about "bad" and "good" from pure observation with no positive/negative feedback.
Take a system (e.g. "movement of people"). Model simplified versions of this system on multiple levels (e.g. "movement of groups" and "movement of individuals"). Take a property of the system (e.g. "freedom of movement"). Describe a biased aggregation of this property on different levels. Choose actions that don't violate this aggregation.
I don't understand much of what is going on in this paragrah.
Take an element of the system (e.g. "sweets") and its properties (e.g. "you can eat sweets, destroy sweets, ignore sweets..."). Describe other elements in terms of this element. Choose actions that don't contradict this description.
It sounds to me like you are trying to cross the is/ought divide -- first the ai learns descriptive facts about a system, and then, the ai is supposed to derive normative principles (action-choice principles) from those descriptive facts. Is that an accurate assesment?
One concern I have is that if the description is accurate enough, then it seems like it should either (a) not constrain action, because you've learned the true invariant properties of the system which can never be violated (eg, the true laws of physics); or, on the other hand, (b) constrain action for the entirely wrong reasons.
An example of (b) would be if the learning algorithm learns enough to fully constrain actions, based on patterns in the AI actions so far. Since the AI is part of any system it is interacting with, it's difficult to rule out the AI learning its own patterns of action. But it may do this early, based on dumb patterns of action. Furthermore, it may misgeneralize the actions so far, "wrongly" thinking that it takes actions based on some alien decision procedure. Such a hypothesis will never be ruled out in the future, and indeed is liable to be confirmed, since the AI will make its future acts conform to the rules as it understands them.
AI models the system ("coins") on two levels: "a single coin" (level 1) and "multiple coins" (level 2).
I don't really understand what it means to model the system on each of these levels, which harms my understanding of the rest of this argument. ("How can you model the system as a single coin?")
My attempt to translate things into terms I can understand is: the AI has many hypotheses about what is good. Some of these hypotheses would encourage the AI to exploit glitches. However, human feedback about what's good has steered the system away from some glitch-exploits in the past. The AI probabilistically generalizes this idea, to avoid exploiting behaviors of the system which seem "glitch-like" according to its understanding.
But, this interpretation seems to be a straightforward value-learning approach, while you claim to be pointing at something beyond simple value learning ideas.
After finishing this long comment, I noticed the inconsistency: I continue to ask "how do we get X?" type questions rather than "what is X?" type questions. In retrospect, I don't like my "billion dollars" analogy as much as I did when I first wrote it. Part of the problem is that when "X" is still fuzzy, it can shift locations in the causal chain as we focus on different aspects of the conversation. So for example, X could point to the "money system", or X could end up pointing to some desirable properties which are upstream/downstream of "money systems". But as X shifts up/downstream, there are some Y which switch between "how-relevant" and "why-relevant". (Things that are upstream of X are how-relevant; things that are downstream of X are why-relevant.) So it doesn't make sense for me to keep mentioning that I'm more interested in how-questions than why-questions, when I'm not sure exactly where the definition of X will sit in the causal chain. I should, at best, have some other reasons for not being very interested in certain questions. But I don't want to re-write the relevant portions of what I wrote. It still represents my epistemic state better than not having written it.
The images in this classic reference post have gone missing! :(
This is just my intuition, but it seems like the core intuition of a "money system" as you use it in the post is the same as the core intuition behind utility functions (ie, everything must have a price ≈ everything must have a quantifiable utility).
I think we can try to solve AI Alignment this way:Model human values and objects in the world as a "money system" (a system of meaningful trades). Make the AGI learn the correct "money system", specify some obviously incorrect "money systems".Basically, you ask the AI "make paperclips that have the value of paperclips for humans". AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can't be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren't worth anything. So you haven't actually gained any money at all.
I think we can try to solve AI Alignment this way:
Model human values and objects in the world as a "money system" (a system of meaningful trades). Make the AGI learn the correct "money system", specify some obviously incorrect "money systems".
Basically, you ask the AI "make paperclips that have the value of paperclips for humans". AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can't be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren't worth anything. So you haven't actually gained any money at all.
In utility-theoretic terms, this is like saying that money is an instrumental goal, not a terminal goal. Or at least, money as-terminal-goal has a low weight compared to other things (eg, human lives). Or perhaps more faithful to what you want: money as-terminal-goal is dependent on a context.
So it seems to me like this still faces the same basic challenges as most other approaches, IE, making the system robustly care about external objects which we can't get perfect feedback about. How do you get it to care about the context? How do you get it to think killing humans is "expensive"? How do you ask the system to "make paperclips that have the value of paperclips for humans"?
I meant that some AIs need to start with understanding human values (perfectly) and others don't.
It seems like any proponent of #2 (human feedback, aka, value learning) would already agree with this idea; whereas your post gave me the sense that you think something more radical is here.
Reiterating the quote from the OP that I quoted before:
The point is that AI doesn't just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn't use this solution. That's the idea: AI can realize that some rewards are unjust because they break the entire reward system.
My best guess about how you want to combine #1 and #2 with #3 is that you want to infer the proper value of things from the environment. EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.
I remember some work a few years ago on this approach -- specifically, using the built environment of humans (together with an assumption that humans are fairly good at optimizing for their own preferences) to infer human values. Sadly, I'm unable to find a reference; maybe it was never published? (Probably I've just forgotten the relevant keywords to search for)
The distinction between instrumental goals vs "terminal goals that depend on some context" is rather blurry, because the way we distinguish between terminal and instrumental goals (from the outside, behaviorally) is how much they vary based on context. (EG, if I take away the other basketball players, the audience, and the money, will one basketball player still try to perform a slam dunk?)
One reason for abandoning utility functions is, perhaps, an instinct that everything must be instrumental, because nothing is truly terminal. I discussed how to do this while keeping most of expected utility theory in An Orthodox Case Against Utility Functions.