Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The goal of this post is to clarify a few concepts relating to AI Alignment under a common framework. The main concepts to be clarified:

  • Optimization. Specifically, this will be a type of Vingean agency. It will split into Selection vs Control variants.
  • Reference (the relationship which holds between map and territory; aka semantics, aka meaning). Specifically, this will be a teleosemantic theory.

The main new concepts employed will be endorsement and legitimacy

TLDR: 

  • Endorsement of a process is when you would take its conclusions for your own, if you knew them. 
  • Legitimacy relates to endorsement in the same way that good relates to utility. (IE utility/endorsement are generic mathematical theories of agency; good/legitimate refer to the specific thing we care about.) 
  • We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity. (Endorse as a way of achieving its goals, if not necessarily our own.)
  • We perceive meaning (semantics/reference) in cases where something has been optimized for accuracy -- that is, the goal we endorse a conclusion with respect to is some notion of accuracy of representation.

This write-up owes a large debt to many conversations with Sahil, although the views expressed here are my own.

Broader Context

The basic idea is to investigate agency as a natural phenomenon, and have some deep insights emerge from that, relevant to the analysis of AI Risk. Representation theorems -- also sometimes called coherence theorems or selection theorems depending on what you want to emphasize -- start from a very normative place, typically assuming preferences as a basic starting object. Perhaps these preferences are given a veneer of behaviorist justification: they supposedly represent what the agent would choose if given the option. But actually giving agents the implied options would typically be very unnatural, taking the agent out of its environment.

There are many ways to drive toward more naturalistic representation theorems. The work presented here takes a particular approach: cognitive reduction. I want to model what it means for one agent to think of something as being another agent. Daniel Dennett called this the intentional stance

The goal of the present essay is to sketch a formal picture of the intentional stance. This is not yet a representation theorem -- it does not establish a type signature for agency based on the ideas. However, for me at least, it seems like an important piece of deconfusion. It paints a picture of agents who understand each other, as thinking things. I hope that it can contribute to a useful representation theorem and a broader picture of agency.

Meaning

In Signalling & Simulacra, I argued that the signal-theoretic[1] analysis of meaning (which is the most common Bayesian analysis of communication) fails to adequately define lying, and fails to offer any distinction between denotation and connotation or literal content vs conversational implicature. A Human's Guide to Words gives a related view:[2]

No dictionary, no encyclopedia, has ever listed all the things that humans have in common.  We have red blood, five fingers on each of two hands, bony skulls, 23 pairs of chromosomes—but the same might be said of other animal species.
[...]
A dictionary is best thought of, not as a book of Aristotelian class definitions, but a book of hints for matching verbal labels to similarity clusters, or matching labels to properties that are useful in distinguishing similarity clusters.
- Eliezer Yudkowski, Similarity Clusters

While I agree that words do not have clear-cut definitions (except, perhaps, in a few cases such as mathematics), I think it would be a mistake to conclude that there is nothing to the denotation/connotation distinction. In my view, words cannot just be probabilistic clusters.

Teleosemantics

In Teleosemantics!, I argue that the denotative meaning is what a symbol is optimized to correspond to

The signal-theoretic analysis of meaning gives us the information-theoretic content of a communicative act; but any observation carries information in this sense. What distinguishes symbolic communication from other information-carrying signals is that some care has been taken to convey specific information. Sometimes the probabilistic information is all we care about; but tracking literal meaning is also an important part of language.

Commonly confused words, such as imply vs infer, probabilistically carry some of each other's signal-theoretic meaning (based on the probability we assign to the speaker having confused the two words). But the community of speakers overall puts optimization pressure on avoiding this confusion, by correcting mistakes noticed in conversation, by writing dictionaries, and through articles about common mistakes.

Similarly, lying happens, so the signal-theoretic meaning of an utterance includes some probability of it being a lie (and the probabilistic implications thereof). But the linguistic community optimizes against lying, so there's a sense in which the lie is not part of the intended meaning of the utterance. 

If a liar claims (falsely) "I am very wealthy", there's a commonsense way in which they mean that they're very wealthy. That's the meaning they're trying to convey. I want to be clear that this does not fit with the technical theory of meaning which I am advocating here.

Plurality of Teleosemantic Meaning

Meaning, on my account, comes from optimizing symbols to have a specific correspondence to reality; that is, to be accurate under a specific intended reading. This is entirely different from optimizing words to create a specific belief in their audience, which is what the liar is doing. On my account, optimizing for impact on an audience rather than accuracy fails to impart those words with (teleosemantic) meaning

Since the liar's use of words does not impart them with meaning in itself, the only relevant optimization process which gives meaning to the words is the broader linguistic community.

However, there are cases where multiple optimization processes may by trying to give meaning to the same words, perhaps at cross-purposes. For example, the scientific community has a distinct notion of "theory" which people sometimes claim is the real meaning. (Wikipedia's disambiguation page provides a large list of communities with their own preferred notion of theory.)

So in general, the theory here requires meaning to be indexed by an agent (or optimization process); we can differentiate what an individual person means by their words from what larger linguistic communities mean by those words.

Now, obviously, in order to understand the teleosemantic definition of meaning, we need to understand what "agent" or "optimization process" or "optimized" means.

Agency

In Belief in Intelligence, Eliezer sketches the peculiar mental state which regards something else as intelligent:

Imagine that I'm visiting a distant city, and a local friend volunteers to drive me to the airport.  I don't know the neighborhood. Each time my friend approaches a street intersection, I don't know whether my friend will turn left, turn right, or continue straight ahead.  I can't predict my friend's move even as we approach each individual intersection - let alone, predict the whole sequence of moves in advance.

Yet I can predict the result of my friend's unpredictable actions: we will arrive at the airport. 
[...]
I can predict the outcome of a process, without being able to predict any of the intermediate steps of the process.

In Measuring Optimization Power, he formalizes this idea by taking a preference ordering and a baseline probability distribution over the possible outcomes. In the airport example, the preference ordering might be how fast they arrive at the airport. The baseline probability distribution might be Eliezer's probability distribution over which turns to take -- so we imagine the friend turning randomly at each intersection. The optimization power of the friend is measured by how well they do relative to this baseline. 

I think this can be a useful notion of agency, but constructing this baseline model does strike me as rather artificial. We're not just sampling from Eliezer's world-model. If we sampled from Eliezer's world-model, the friend would turn randomly at each intersection, but they'd also arrive at the airport in a timely manner no matter which route they took -- because Eliezer's actual world-model believes the friend is capably pursuing that goal.

So to construct the baseline model, it is necessary to forget the existence of the agency we're trying to measure while holding other aspects of our world-model steady. While it may be clear how to do this in many cases, it isn't clear in general. I suspect if we tried to write down the algorithm for doing it, it would involve an "agency detector" at some point; you have to be able to draw a circle around the agent in order to selectively forget it. So this is more of an after-the-fact sanity check for locating agents, rather than a method of locating agents in the first place.

I will propose a variation of Eliezer's definition which does not have this problem. First, though, I have to define some terminology.

Endorsement

I will say that one set of beliefs[3]  endorses[4] another  (with respect to topic ) in the case that, if we condition  on 's belief about a thing, then  adopts that belief as its own:

Belief Endorsement: Probability distribution  belief-endorses  in the case that .

This is already enough to get my proposed formal definition of meaning off the ground. When the above condition holds,  sees  as a belief about . The notation somewhat obscures the dependence on a particular translation between 's ontology and 's ontology. The quotation marks are communicating a translation into 's event algebra. We can't just abstractly relate two probability distributions to each other, because Alice endorsing Bob's beliefs has to do with how Bob relates to the world, not just what his probability distribution is as a pure mathematical abstraction. Imagine Alice seeing Bob's beliefs initially as just some random variable; interpreting them as a probability distribution is an extra step.

For example, Alice and Bob both start out regarding a coinflip as 50-50. Bob gets to see the coin after the flip, and updates to near-certainty about which way the coin landed. Because of the way Alice knows Bob's beliefs about the coin have been correlated with the coin itself, and because Alice's beliefs have not been so correlated, Alice now endorses Bob's belief about "the coin landed heads" as meaning the coin landed heads, and Bob's "the coin landed tails" as meaning the coin landed tails. This is not because Alice thinks Bob's beliefs are perfectly accurate; only more accurate than Alice's.

However, to see why this is a teleosemantic notion of meaning, I need to generalize "endorsement" more, in order to properly relate it to optimization. I will generalize it in several steps.

We can make a generalized definition of endorsement handle random variables, rather than only probabilities:

Expectation Endorsement

Here,  gives the expected value of random variable  when worlds  are sampled from . This notion of endorsement generalizes the previous, because we can take  to be the indicator variable for , which has value 1 when  and zero otherwise; in this case, .

However, this is not yet a full notion of endorsement of decisions, because the above doesn't necessarily make sense for non-convex optimization. If  represents a decision between three doors, , then an expected value of 2 could indicate a belief that the optimal choice is door 2, or it could indicate complete uncertainty between the three options! We want to instead say that one agent would endorse another agent's choice rather than its expectation.

Just as  gives an agent's beliefs and  gives its expected values, I introduce  to represent an agent's choices. This is similar to choice functions as sometimes appear in representation theorems. Given a set of possible 'actions' , the choice function  gives agent 1's choice  under beliefs  and objective function  can be thought of as the mathematical optimization algorithm which the agent uses, such as simulated annealing. A particularly simple choice of  is the argmax of the expectation: . When the set of possible choices  is small, it will often make sense to assume that  is this argmax. However, we need to avoid making the assumption if we want to reason about things like quantilization. For notational convenience,  means ; that is, the probability distribution is first conditioned on  before optimizing.

Having established this notation, our next notion of endorsement is a straightforward parallel of the previous definitions:

Selection Endorsement: Agent 1, having beliefs  and a choice function , selection-endorses Agent 2 as optimizing  via the choice  if and only if .

So, Alice thinks of Bob as optimizing for  if Alice would copy Bob's answer (if she were optimizing for ). 

Much like the previous definitions, the quotations emphasize that Bob's choice is really a variable in Alice's ontology, and treating it as a choice is a matter of interpretation.  is really an event of a random variable taking on a specific value, .

A previous revision of this essay defined selection endorsement via an argmax instead of introducing the concept of choice functions. This amounts to a hidden assumption of unbounded computational resources on the Alice side, which would harm our ability to recognize Bobs who are agentic in more bounded ways.

Selection endorsement somewhat resembles The ground of optimization if we take the random variable  to be a system's state at a single time-slice. I selection-endorse the whole system (as optimizing some ) if, ignorant of what scores highly in  myself, I would take the system's state as a plausible answer.

Since  is a mathematically pure function , this is selection in a Selection vs Control sense.[5] If we think of the variable  (which we interpret as Bob's choice) as some little piece of the world, it is being optimized to maximize some property of itself, not some property of the broader world. 

If  is an impure function, its values depend on the wider world: , where  is the sample space.[6] Now, the optimization has to be explicitly defined with a dependence on worlds . The argmax would look like  instead of .

We have to slightly revise our notion of endorsement once again, to represent an agent thinking of another agent as optimizing in this broader sense:

Control Endorsement: , where  is now a two-argument function, taking worlds and actions, rather than a one-argument function. 

Sorry, the current notation doesn't highlight the difference between this definition and the previous one. But this is a big distinction.[5]

Control endorsement is the notion of intentional stance that I have been driving towards. The formula tracks someone tracking agency. If Alice control-endorses Bob (as optimizing ), Alice sees Bob as an agent.

This resembles ideas in Vingean Agency and Optimization at a Distance. Like Eliezer's Measuring Optimization Power, agency is defined relative to a baseline distribution (); but I think the sort of baseline distribution we need to get this definition to work is much less artificial than the sort which Eliezer's definition needed. It can just be our honest beliefs.

Control endorsement generalizes selection endorsement, since  can ignore its first argument; and generalizes expectation endorsement, since  can be some loss function for which the expected value is the loss-minimizing answer; and generalizes belief endorsement when that loss function is a proper scoring rule:

Belief Endorsement Revisited: , where  is the event , and  is a proper scoring rule judging the accuracy of  with respect to target event .

(Here, the choice function is assumed to be argmax for simplicity, since results about proper scoring rules incentivizing honest belief reports generally assume we can argmax.)

As another important variation on endorsement, we can fix  to be the radical interpreter's[7] own utility function:

"Absolute" Endorsement: control endorsement where  is the utility function of the observer whose beliefs are .

Speculatively, this might have some useful connection to corrigibility. If Alice were to absolutely endorse Bob's actions, then Alice should be fine with Bob modifying Alice's source code.

Conditional Endorsement

At first, it might seem like an endorsement-based notion of optimization process loses something relative to Eliezer's measurement of optimization power: endorsement just gives a binary yes/no, rather than measuring a degree of optimization.

However, consider an example with Alice, Bob, and Carol. Alice endorses both Bob and Carol, but she continues to endorse Carol even after learning Bob's decision; the reverse is not the case. Obviously, she trusts Carol more than Bob. We can formalize this via conditional endorsement:

Conditional (Control) Endorsement:  endorses  given  iff , where  is the event , and  is the event .

(We can easily adapt this to notions of conditional endorsement for the various sorts of endorsement.)

Thus, holding the utility function fixed but changing the random variable considered, we can (partially) order different random variables by how endorsed they are. (I will not try to prove transitivity here, since my goal is to get the overall picture across; I have not checked it, although I expect it to hold.)

Radical Probabilism & Other Generalizations

So far, I've used a utility function to represent values, which may surprise those familiar with An Orthodox Case Against Utility Functions. I expect these concepts to generalize well to concepts beyond simple utility functions, but I didn't want to overcomplicate things here -- I am focusing on getting the basic picture across. Similarly, these notions already work well with Radical Probabilism, but I haven't emphasized that here. Infrabayesian versions of all this could also be interesting, and I expect, not particularly difficult to define.

Legitimacy

Legitimacy is to endorsement as goodness is to utility: "utility" is an abstract notion of an agent's preferences, whereas "good" is the thing we actually care about. Similarly, "endorsement" is an abstract notion intended to apply to agency in general, while "legitimacy" is the thing we care about as humans.

So, a mode of reasoning is legitimate if it has a reliable tendency toward the truth. Correct mathematical proofs are legitimate. The scientific method, when carried out well, is legitimate. 

Wireheading is not legitimate. Taking a murder pill is not legitimate. Some drugs impact your reasoning in legitimate ways, while others are illegitimate. 

This is supposed to be the same concept of legitimacy from Nora's sequence on the value change problem. Hence, aligning an AI toward legitimacy is supposed to be a solution to the problems mentioned there.

I think there may be a technical sense (yet to be articulated) in which we'd prefer to align AI to legitimacy, but we can only align it to specific forms of endorsement and hope that's close enough. An AI aligned to endorsement can (more or less...) only do things that humans would consent to. But humans are sometimes grateful in retrospect about things which seemed terrible at the time; for example, if math homework was an important step in your coming to love math, but at the time felt like something you were coerced into.

By analogy, our current endorsement attitudes could be mistaken. We know some cases, such as addictive drugs, where we would not say the change may actually be legitimate despite our current anti-endorsement of it; but there may be some other cases where we would never consent to the change, but (in some difficult-to-pin-down sense) the change would actually be correct, not only in retrospect, but legitimately.

However, anyone who purposefully built a superintelligent AGI aligned to some notion of legitimacy designed to trample on endorsement in select cases would be unilaterally imposing their own guess at legitimate values. I claim this is bad behavior.

Updatelessness

I think it is clear that endorsement gives a picture where, when Alice considers whether Bob is an agent, Bob being updateless will boost his agency and any updateful behavior from Bob will be to the detriment of Bob's agency. After all, Alice is judging from the perspective of . Any optimization will be judged by how well it optimizes the expectations of . (Although, as I indicated in the beginning, fleshing this out into an actual representation theorem is part of the goal here but isn't something accomplished in the present work.)

This suggests that Alice should probably be updateless as well; that is,  should be Alice's prior, not Alice's posterior.

In general, though, I think it will be important to consider a variety of probability distributions from which to judge endorsement. Conditional endorsement is already an example of this. Resource-bounded agents can be studied by considering baseline distributions from different complexity classes. 

The Van Fraassen Reflection Principle

Students of philosophy may notice the resemblance between my notion of "endorsement" and the Reflection Principle. In my language, the Reflection Principle says that if  is an agent's beliefs at one time, and  is an agent's beliefs at a future time, then  should belief-endorse .

As such, some might argue that "endorsement" should be called "reflection". I would argue that "reflection" means too many things, and my account differs enough from Van Frassen -- the Reflection Principle is about insisting that X should endorse Y in specific cases, whereas I am more focusing on studying the relation in general. Indeed, I would say there are cases where it is not rational to endorse our future beliefs, such as when we plan to be inebriated at a specific time.

Questions & Conjectures

  • How well can we use conditional endorsement to characterize optimization power (or more generally, level of endorsement)? Is it transitive? 
  • What further generalizations or alternative definitions of endorsement might be important?
  • How do we build a useful representation theorem out of this idea?
  • Can we prove something within this framework along the lines of capable agents have beliefs?
  • How can we integrate value change into this picture?
  • There's a big difference between endorsing some beliefs as a posterior (which is accuracy-centric) and endorsing them as a prior for use in updateless decisionmaking. How should this be characterized?
  • It seems sensible to call selection "myopic" and control "nonmyopic". However, although epistemic accuracy falls on the control side, it doesn't feel very control-oriented, and it feels to me like there's a strong sense in which it is myopic. Can this be characterized?
  • Looking at things through an algorithmic information theory lens, it makes sense to say that endorsement, as an interpretation of something as an optimization process, is a "better" interpretation when the utility function used to interpret something as an agent is simpler. However, because the utility function in control endorsement has access to strictly more data than that in selection endorsement, interpreting something as a control process can't be much more complex than interpreting it as a selection process -- worst-case, you just have to discard the  and then apply the U you used to interpret it as a selection process. So the interesting case is where the control-process interpretation can be much simpler than the selection-process interpretation.
  • I have an intuition that insights for eliciting latent knowledge can be uncovered by examining what happens when we translate back and forth between interpreting something as a selection process vs a control process. A control process has to track information about the external world (has to "have beliefs") in order to do its job; a selection process has no need to do this. So in order for Alice to see Bob as a selection process rather than a control process, she has to understand how he selects: she has to include the belief calculations Bob uses to select as part of the utility function instead of abstracting them away. If we can understand something as a selection process instead of a control process, it seems intuitively more trustable. If the control process is hiding something from us, can we reveal the information by translating it to a selection process?
  • Can something interesting be said about inner alignment via this framework?
  • What does it look like to build a system with endorsement as the target? 
  • I imagine there is something to be gained by thinking about different ways of varying all the parameters of endorsement, the way conditional endorsement varies the baseline probability distribution. 
  1. ^

    I'm saying "signal-theoretic" rather than "signalling-theoretic" here because it sounds better to me; but the field is called "signalling theory" not "signal theory".

  2. ^

    I don't recall the appropriate reference now, but I think at one point Eliezer defines lying as communicating with the intent to mislead, IE the intent to make your audience's beliefs less accurate. 

    While I think this is a pretty good pragmatic definition, it fails to differentiate lying from filtering evidence or other clever ways to mislead. 

    Of course, what sorts of standards you hold your definitions to is part of the debate, here. 

  3. ^

    For our purposes here, it is fine to imagine that "beliefs" are probability distributions as normally defined. But I also have in mind more computationally bounded notions, which may be only approximately probabilistically coherent; and we can also consider other variations, such as infradistributions. 

    In any case, for now, my analysis is outsourcing the notion of "belief" to other theories. A more fully-developed version with its own representation theorem should wrap around and justify whatever notion of "belief" is used, by going on to establish that the notion of "belief" assumed at this stage can then be derived as part of the representation theorem. In other words, my goal is a picture of an agent thinking about an agent, where the two have the same type signature.

  4. ^

    Credit for the term "endorsement" goes to Scott Garrabrant.

  5. ^

    We can slide between selection and control if we are happy to vary how  and  are defined. Since random variables are a function of the whole event-space anyway, we can pack as much information about the world into  as we like, so long as we are happy to make  just not care about those extra bits of information. So we could model a smart thermostat turning on and off as an attempt to control the whole house's temperature (a utility function which involves variables it must estimate rather than observe directly), by selecting the on-or-off state with the highest expected value (a mathematically pure function of variables which the thermostat either directly observes or directly controls). So it can be seen as a selector in one sense or a controller in another sense.

    However, seeing an agent as a selector rather than a controller means that  has to encode the beliefs of the agent, to calculate the expected values. So beliefs are being represented as part of the preferences. When seeing an agent as a controller, the beliefs are instead seen as a part of the mechanism for optimizing the preferences. I expect this means it is often simpler to view something as a controller, if it is good at controlling; the selection view is overcomplicated by the belief information.

  6. ^

    It feels a bit weird that the utility function takes the world and the action, here. We have to think of this as being in world , but then counterfacting on action . I could see this aspect of the theory being fiddled with.

  7. ^

    If Alice is interpreting Bob as an agent, Alice is the "radical interpreter". This terminology comes from philosophy; the "radical interpreter" is like an alien looking at human brains and trying to interpret the meaning of human beliefs.

New Comment
17 comments, sorted by Click to highlight new comments since: Today at 10:41 AM

(I will not try to prove transitivity here, since my goal is to get the overall picture across; I have not checked it, although I expect it to hold.)

Transitivity doesn't hold, here's a counterexample.

The intuitive story is: X's action tells you whether Z failed, Y fails sometimes, and Z fails more rarely.

The full counterexample (all of the following is according to your beliefs ): Say available actions are 0 and 1. There is a hidden fair coin, and your utility is high if you manage to match the coin, and low if you don't. Y peeks at the coin, and takes the correct action, except when it fails, which has a 1/4 chance. Z does the same, but it only fails with a 1/100 chance. X plays 1 iff Z has failed.
Given X's and Y's action, you always go with Y's action, since X tells you nothing about the coin, and Y gives you some information. Given Z's and Y's actions, you always go with Z's, because it's less likely to have failed (even when they disagree). But given Z's and X's, there will be some times (1/100), in which you see X played 1, and then you will not play the same as Z.

The same counterexample works for beliefs (or continuous actions) instead of discrete actions (where you will choose a probability  to believe, instead of an action ), but needs a couple small changes. Now both Z and Y fail with 1/4 probability (independently). Also, Y outputs its guess as 0.75 or 0.25 (instead of 1 or 0), because YOU (that is, ) will be taking into account the possibility that it has failed (and Y better output whatever you will want to guess after seeing it). Instead of Z, consider A as the third expert, which outputs 0.5 if Z and Y disagree, 15/16 if they agree on yes, and 1/16 if they agree on no. X still tells you whether Z failed. Seeing Y and X, you always go with Y's guess. Seeing A and Y, you always go with A's guess. But if you see A = 15/16 and X = 1, you know both failed, and guess 0. (In fact, even when you see X = 0, you will guess 1 instead of 15/16.)

Ah, very interesting, thanks! I wonder if there is a different way to measure relative endorsement that could achieve transitivity.

My intuition says the natural thing would be to assume something about the experts not talking about each other (which probably means being independent, which sounds too strong). I feel like whenever they can talk about each other an example like this will exist. But not sure! Maybe you can have a relative endorsement definition that's more like "modulo the other information I'm receiving about you from the environment, I treat the additional bits you're giving me as the best information".

I feel like there's a key concept that you're aiming for that isn't quite spelled out in the math.

I remember reading somewhere that there's a typically unmentioned distinction between "Bayes' theorem" and "Bayesian inference". Bayes' theorem is the statement about , which is true from the axioms of probability theory for any  and  whatsoever. Notably, it has nothing to do with time, and it's still true even after you learn . On the other hand, Bayesian inference is the premise your beliefs should change in accordance with Bayes' theorem. Namely that  where  is an observation. That is, when you observe something, you wholesale replace your probability space  with a new probability space  which is calculated by applying the conditional (via Bayes' theorem).

And I think there's a similar thing going on with your definitions of endorsement. While trying to understand the equations, I found it easier to visualize  and  as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that  endorses  on event  if .

But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between  and , whereas your concept of endorsement is not symmetrical. It seems like the intention is that  "learns" or "hears about" 's belief, and then  updates (in the above Bayesian inference sense) to have a new  that has the consistency condition with .

By putting  in the conditional, you're saying that it's an event on , a thing with the same type as . And it feels like that's conceptually correct, but also kind of the hard part. It's as if  is modelling  as an agent embedded into .

There are several compromises I made for the sake of getting the idea across as simply as I could. 

  • I think the graduate-level-textbook version of this would be much more clear about what the quotes are doing. I was tempted to not even include the quotes in the mathematical expressions, since I don't think I'm super clear about why they're there.
  • I totally ignored the difference between  (probability conditional on ) and  (probability after learning ).
  • I neglect to include quantifiers in any of my definitions; the reader is left to guess which things are implicitly universally quantified.

I think I do prefer the version I wrote, which uses  rather than , but obviously the English-language descriptions ignore this distinction and make it sound like what I really want is .

It seems like the intention is that  "learns" or "hears about" 's belief, and then  updates (in the above Bayesian inference sense) to have a new  that has the consistency condition with .

Obviously we can consider both possibilities and see where that goes, but I think maybe the conditional version makes more sense as a notion of whether you right now endorse something. A conditional probability is sort of like a plan for updating. You won't necessarily follow the plan exactly when you actually update, but the conditional probability is your best estimate.

To throw some terminology out there, let's call my thing "endorsement" and a version which uses actual updates rather than conditionals "deference" (because you'd actually defer to their opinions if you learn them). 

  • You can know whether you endorse something, since you can know your current conditional probabilities (to within some accuracy, anyway). It is harder to know whether you defer to something, since in the case where updates don't equal conditionals, you must not know what you are going to update to. I think it makes more sense to define the intentional stance in terms of something you can more easily know about yourself. 
  • Using endorsement to define agency makes it about how you reason about specific hypotheticals, whereas using deference to try and define agency would make it about what actually happens in those hypotheticals (ie, how you would actually update if you learned a thing). Since you might not ever get to learn that thing, this makes endorsement more well-defined than deference. 

Bayes' theorem is the statement about , which is true from the axioms of probability theory for any  and  whatsoever.

I actually prefer the view of Alan Hajek (among others) who holds that P(A|B) is a primitive, not defined as in Bayes' ratio formula for conditional probability. Bayes' ratio formula can be proven in the case where P(B)>0, but if P(B)=0 it seems better to say that conditional probabilities can exist rather than necessarily being undefined. For example, we can reason about the conditional probability that a meteor hits land given that it hits the equator, even if hitting the equator is a measure zero event. Statisticians learn to compute such things in advanced stats classes, and it seems sensible to unify such notions under the formal P(A|B) rather than insisting that they are technically some other thing.

By putting  in the conditional, you're saying that it's an event on , a thing with the same type as . And it feels like that's conceptually correct, but also kind of the hard part. It's as if  is modelling  as an agent embedded into .

Right. This is what I was gesturing at with the quotes. There has to be some kind of translation from  (which is a mathematical concept 'outside' ) to an event inside . So the quotes are doing something similar to a Goedel encoding.

While trying to understand the equations, I found it easier to visualize  and  as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that  endorses  on event  if .

But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between  and , whereas your concept of endorsement is not symmetrical.

The asymmetry is quite important. If we could only endorse things that have exactly our opinions, we could never improve.

I argued that the signal-theoretic[1] analysis of meaning (which is the most common Bayesian analysis of communication) fails to adequately define lying, and fails to offer any distinction between denotation and connotation or literal content vs conversational implicature.

In case you haven't come accross this, here are two papers on lying by the founders of the modern economics literature on communication. I've only skimmed your discussion but if this is relevant, here's a great non-technical discussion of lying in that framework. A common thread in these discussions is that the apparent "no-lying" implication of the analysis of language in the Lewis-Skyrms/Crawford-Sobel signalling tradition relies importantly on common knowledge of rationality and, implicitly, on common knowledge of the game being played, i.e. of the available actions and all the players' preferences.

Thanks! 

I have some comments on the arbitrariness of the "baseline" measure in Yudkowsky's measure of optimization.

Sometimes, I am surprised in the moment about how something looks, and I quickly update to believing there's an optimization process behind it. For example, if I climb a hill expecting to see a natural forest, and then instead see a grid of suburban houses or an industrial logging site, I'll immediately realize that there's no way this is random and instead there's an optimization process that I wasn't previously modelling. In cases like this, I think Yudkowsky's measure accurately captures the measure of optimization.

Alternatively, sometimes I'm thinking about optimization processes that I've always known are there, and I'm wondering to myself how powerful they are. For example, sometimes I'll be admiring how competent one of my friends is. To measure their competence, I can imagine what a "typical" person would do in that situation, and check the Yudkowsky measure as a diff. I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.

While it may be clear how to do this in many cases, it isn't clear in general. I suspect if we tried to write down the algorithm for doing it, it would involve an "agency detector" at some point; you have to be able to draw a circle around the agent in order to selectively forget it.

I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this. The potential optimization process will be in that average, but it will be washed out by all the other trajectories (assuming most trajectories don't go up the ordering nearly as much; if they did, then your observed process would rightly not register as an optimizer).

(Obviously this is not helpful for e.g. looking into a neural network and figuring out whether it contains something that will powerfully optimize the world around you. But that's not what this level of the framework is for; this level is for deciding what it even means for something to powerfully optimize something around you.)

Of course, to run this comparison you need a "baseline" of a measure over every possible trajectory. But I think this is just reflecting the true nature of optimization; I think it's only meaningful relative to some other expectation.

I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.

I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn't already rely on some other agent-detector piece which helps us understand how to remove the agent.

I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this.

Looking back at Flint's work, I don't agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no "compare your optimizer to this" step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the 'optimizer') is sensitive to very small perturbations, which can take the whole system out of the attractor basin. 

In any case, I agree that Flint's work also eliminates the need for an unnatural baseline in which we have to remove the agent. 

Overall, I expect my definition to be more useful to alignment, but I don't currently have a well-articulated argument for that conclusion. Here are some comparison points:

  • Flint's definition requires a system with stable dynamics over time, so that we can define an iteration rule. My definition can handle that case, but does not require it. So, for example, Flint's definition doesn't work well for a goal like "become President in 2030" -- it works better for continual goals, like "be president".
  • Flint's notion of robustness involves counterfactual perturbations which we may never see in the real world. I feel a bit suspicious about this aspect. Can counterfactual perturbations we'll never see in practice be really relevant and useful for reasoning about alignment?
  • Flint's notion is based more on the physical system, whereas mine is more about how we subjectively view that system. 
  • I feel that "endorsement" comes closer to a concept of alignment. Because of the subjective nature of endorsement, it comes closer to formalizing when an optimizer is trusted, rather than merely good at its job. 
  • It seems more plausible that we can show (with plausible normative assumptions about our own reasoning) that we (should) absolutely endorse some AI, in comparison to modeling the world in sufficient detail to show that building the AI would put us into a good attractor basin.
  • I suspect Flint's definition suffers more from the value change problem than mine, although I think I haven't done the work necessary to make this clear.

Looking back at Flint's work, I don't agree with this summary.

Ah, sorry, I wasn't intending for that to be a summary. I found Flint's framework very insightful, but after reading it I sort of just melded it into my own overall beliefs and understanding around optimization. I don't think he intended it to be a coherent or finished framework on its own, so I don't generally try to think "what does Flint's framework say about X?". I think its main influence on me was the whole idea of using dynamical systems and phase space as the basis for optimization. So for example;

In any case, I agree that Flint's work also eliminates the need for an unnatural baseline in which we have to remove the agent.

I would say that working in the framework of dynamical systems is what lets one get a natural baseline against which to measure optimization, by comparing a given trajectory with all possible trajectories.

I think I could have some more response/commentary about each of your bullet points, but there's a background overarching thing that may be more useful to prod at. I have a clear (-feeling-to-me) distinction between "optimization" and "agent", which doesn't seem to be how you're using the words. The dynamical systems + Yudkowsky measure perspective is a great start on capturing the optimization concept, but it is agnostic about (my version of) the agent concept (except insofar as agents are a type of optimizer). It feels to me like the idea of endorsement you're developing here is cool and useful and is... related to optimization, but isn't the basis of optimization. So I agree that e.g. "endorsement" is closer to alignment, but also I don't think that "optimization" is supposed to be all that close to alignment; I'd reserve that for "agent". I think we'll need a few levels of formalization in agent foundations, and you're working toward a different level than those, and so these ideas aren't in conflict.

Breaking that down just a bit more; let's say that "alignment" refers to aligning the intentional goals of agents. I'd say that "optimization" is a more general phenomenon where some types of systems tend to move their state up an ordering; but that doesn't mean that it's "intentional", nor that that goal is cleanly encoded somewhere inside the system. So while you could say that two optimizing systems "are more aligned" if they move up similar state orderings, it would be awkward to talk about aligning them.

(My notion of) optimization has its own version of the thing you're calling "Vingean", which is that if I believe a process optimizes along a certain state ordering, but I have no beliefs about how it works on the inside, then I can still at least predict that the state will go up the ordering. I can predict that the car will arrive at the airport even though I don't know the turns. But this has nothing to do with the (optimization) process having beliefs or doing reasoning of any kind (which I think of as agent properties). For example I believe that there exists an optimization process such that mountains get worn down, and so I will predict it to happen, even though I know very little about the chemistry of erosion or rocks. And this is kinda like "endorsement", but it's not that the mountain has probability assignments or anything.

In fact I think it's just a version of what makes something a good abstraction; an abstraction is a compact model that allows you to make accurate predictions about outcomes without having to predict all intermediate steps. And all abstractions also have the property that if you have enough compute/etc. then you can just directly calculate the outcome based on lower-level physics, and don't need the abstraction to predict the outcome accurately.

I think that was a longer-winded way to say that I don't think your concepts in this post are replacements for the Yudkowsky/Flint optimization ideas; instead it sounds like you're saying "Assume the optimization process is of the kind that has beliefs and takes actions. Then we can define 'endorsement' as follows; ..."

I'll also note that I think what you're calling "Vingean agency" is a notable sub-type of optimization process that you've done a good job at analyzing here. But it's definitely not the definition of optimization or agency to me. For example, in the post you say

We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity.

This doesn't feel true to me (in the carve-nature-at-its-joints sense). I think children are strongly agents, even though I do everything more competently than they do.

Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition. 

First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.

But, if children or animals who are intuitively agents often don't fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on "smaller" perspectives is very important.

Wouldn't the granularity of the action space also impact things? For example, even if a child struggles to pick up some object, you would probably do an even worse job if your action space was picking joint angles, or forces for muscles to apply, or individual timings of action potentials to send to separate nerves.

[-]habryka4moΩ330

Promoted to curated! This post is denser in math than what I would usually consider for curation, but I feel like this kind of topic is quite important and also of broader relevance than more ML-focused alignment work often is. I particularly like the set of careful definitions at the top in the TLDR. I am not sure how much they will hold up as I try to use them more in my thinking, but I already feel like they helped me understand the relevant concepts more. 

I'm new here. Where would I post something like this for discussion? It seems applicable to this article. I believe there is a simpler approach but might require quantum computing for it to be useful since the number of beliefs to be updated is so large.

(1) No writer lies nor intentionally disrespects another living being unless they believe that they are justified. Ergo, all statements are true within. a belief context.

(2) If (1) is true, then the belief context of the writer any statement is required, 

(3) if belief context cannot be known completely, then the closest context that would allow the statement to be true should be assumed. This requires evaluating existing beliefs for every statement that is added to the corpus. 

(4) It also requires an exhaustive encyclopedia of beliefs. However, this is a solvable problem is beliefs follow a binary decision tree having to do with self perception, perception of others, etc. 

(5) all beliefs about a given state. can be expressed in a number with N bits where the total # of bits are the belief decisions that can be made about self, experience, reality, justice. The subject need not be known, just the beliefs about the subject. Beliefs do need to be ordered starting with beliefs about the self, then others, then the world. In the end, order doesn't really matter, but grouping does.

(6) When a given state results in multiple possible belief patterns, the one with the least amount of judgement must be taken as the truth. That is, the matching belief set must be as neutral as possible, otherwise we are inferring intent without evidence. Neutral is defined as matching at the most significant bit possible.

(7) When learning occurs, any beliefs about the new belief must be re-evaluated taking in the new knowledge.

These definitions seem like they only allow Alice to recognize agents that are "as strong as" Alice in the sense that Alice doesn't think she could improve on their decisions.  For instance, Alice won't endorse Bob's chess moves if Alice can play chess better than Bob (even if Carol would endorse Bob's moves).  Have I understood correctly?

Interesting post! As a technical matter, I think the notion you want is not reflection (or endorsement) but some version of  Total Trust, where (leaving off some nuance) Agent 1 totally trusts Agent 2 if  for all . In general, that's going to be equivalent to Alice being willing to outsource all decision-making to Bob if she's certain Bob has the same basic preferences she does. (It's also equivalent to expecting Bob to be better on all absolutely continuous strictly proper scoring rules, and a few other things.)