# Ω 7

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Abram Demski has been writing about Normativity. The suggested models so far have mostly looked at actions rather than semantics, despite suggestions that this is possible and language learning as a motivating example. There is a simple mechanism that seems to me to mostly fit that bill.

## Model

There is an interpreter and an assumed speaker. The interpreter receives observation data as an input, which contains some things put there by the speaker among others. The interpreter has a language in which it can express all its beliefs. Since we want to form beliefs about meaning, this language can talk about meaning: it can from propositions of the form "[Observation] means [propositon].". Note that this is different from "[Observation] implies [propositon].". At least initially, "means" is not related to anything else. The interpreter also has an epistemology module that forms its beliefs about things other than meaning.

We follow a simple prior-update-paradigm. We start out with a prior over hypothesis about meaning. Such a hypothesis generates a propability distribution over all propositions of the form "[Observation] means [propositon]." for each observation (including the possibility that the observation means nothing). Updating is based on the principle that the speaker is authoritative about what he means. Our interpretation of what he's saying should make the things we interpret him to say about what he's saying true. To update on an observation history, first we compute for each observation in it our summed prior distribution over what it means. Then, for each hypothesis in the prior, for each observation, take the hypothesis-distribution over its meaning, combine it with the prior-distribution over all the other observations, and calculate the propability that the speakers statements about what he meant were right. After you've done that for all observations, multiply them to get the score of that hypothesis. Multiply each hypothesis's score with its prior and renormalize. Then, take the resulting propabilities as your new prior, and repeat infinitely.

Lets run an example of this. Your prior is 50% hypothesis A, 50% hypothesis B. There are only two observations. A gives observation1 90% of meaning "Its cold" and 10% of meaning nothing, and gives observation2 80% of meaning "When I describe a local state, the state I mean always obtains", and 20% to mean nothing. B gives observation1 90% to mean "Its raining", 10% to mean nothing, and gives observation2 80% to mean "When I describe a local state, the state I mean obtains with 60% propability", and 20% to mean nothing. The epistemology module says 10% its cold, 30% its raining.

First we create the total prior distribution: it gives observation1  of meaning "Its cold",  of meaning "Its raining", and  to mean nothing, and gives observation2  to mean "When I describe a local state, the state I mean always obtains",  to mean "When I describe a local state, the state I mean obtains with 60% propability", and  of meaning nothing.

Evaluating hypothesis A: for the first observation, in the 0.1 cases where it means nothing everything is consistent in the 0.9 cases where it means "Its raining", in the 0.2 cases where the second observation means nothing, its consistent. In the 0.4 where the second means hes always right, it consistent only in the 0.1 cases where it really is cold. In the 0.4 cases where the second means hes right 60% of the time, theres a 0.1 chance hes right and he gave that 0.6, and a 0.9 chance hes wrong which he gave 0.4. Overall this comes out to .

For the second observation, a similar breakdown gives . Together that makes . For hypothesis B for the first observation its . For the second observation its , for a total score of . Multiplying with 0.5 each and normalizing, we get 37,48% for A and 62,52% for B.

This gets us new total propabilities: observation1 0.34 "Its cold", 0,56 "Its raining", 0,1 nothing. Observation2 0,3 "Always right", 0,5 "60% right", 0,2 nothing. Calculating through the hypothesis again gives us 28% A and 72% B. And if we keep going this will trend to 100% B. Is it bad that we became certain after just two sentences? Not necessarily, because that's assuming those are the entire history of observation. If you think there will be future observations, you'll have to do this interpretation process for each possible future separately, and then aggregate them by the propability you give those futures.

## Evaluation

Why would this be a good way to learn meaning? The updating rule is based on taking the speakers feedback about what he means. This means that if we already have a pretty good idea of what he means, we can improve it further. The correct meaning will be a fixed point of this updating rule - that is, there's no marginal reinterpretation of it where your explanations of what you mean would better fit what you're saying. So if your prior is close enough to correct that it's in the basin of attraction of the correct fixed point, you're set. Unlike the problems with getting a good enough prior that came up in the context of the no-free-lunch theorems however, these priors boil down to just a few fixed points, so there is a limit to how much precision is needed in the prior.

Lets see how this scores on the desiderata for normativity learning (most recent version here):

1. No Perfect Feedback: we want to be able to learn with the possibility that any one piece of data is corrupt.
1. Uncertain Feedback: data can be given in an uncertain form, allowing 100% certain feedback to be given (if there ever is such a thing), but also allowing the system to learn significant things in the absence of any certainty. (Achieved. Feedback can be given in any form allowed by the interpreter language.)
2. Reinterpretable Feedback: ideally, we want rich hypotheses about the meaning of feedback, which help the system to identify corrupt feedback, and interpret the information in imperfect feedback. To this criterion, I add two clarifying criteria:
1. Robust Listening: in some sense, we don't want the system to be able to "entirely ignore" humans. If the system goes off-course, we want to be able to correct that. (Questionable. This criterion essentially depends on what correct meaning is and so is hard to evaluate. That said unlike with recursive quantilizers, there is no pre-given "literal" meaning that gets improved upon - there is only the prior which gets narrowed down. If the prior is good enough it works, if not not)
2. Arbitrary Reinterpretation: at the same time, we want the AI to be able to entirely reinterpret feedback based on a rich model of what humans mean. This criterion stands in tension with Robust Listening. However, the proposal in the present post is, I think, a plausible way to achieve both. (Achieved)
2. No Perfect Loss Function: we don't expect to perfectly define the utility function, or what it means to correctly learn the utility function, or what it means to learn to learn, and so on. At no level do we expect to be able to provide a single function we're happy to optimize. This is largely due to a combination of Goodhart and corrupt-feedback concerns. (Achieved. Our reward is determined by what the speaker says about meaning, which is itself under investigation. The following two are arguably achieved. The "levels" arent as exposed in this model - they would be inside the interpreter language, which is presumably strong enough to talk about them individually or in general.)
1. Learning at All Levels: Although we don't have perfect information at any level, we do get meaningful benefit with each level we step back and say "we're learning this level rather than keeping it fixed", because we can provide meaningful approximate loss functions at each level, and meaningful feedback for learning at each level. Therefore, we want to be able to do learning at each level.
2. Between-Level Sharing: Because this implies an infinite hierarchy of levels to learn, we need to share a great deal of information between levels in order to learn meaningfully. For example, Occam's razor is an important heuristic at each level, and information about what malign inner optimizers look like is the same at each level.
3. Process Level Feedback: we want to be able to give feedback about how to arrive at answers, not just the answers themselves.
1. Whole-Process Feedback: we don't want some segregated meta-level which accepts/implements our process feedback about the rest of the system, but which is immune to process feedback itself. Any part of the system which is capable of adapting its behavior, we want to be able to give process-level feedback about. (Achieved. This is the central function of the interpreter language - it captures everything the interpreter thinks, and so all of it can get feedback.)
2. Learned Generalization of Process Feedback: we don't just want to promote or demote specific hypotheses. We want the system to learn from our feedback, making generalizations about which kinds of hypotheses are good or bad. (Achieved. You do have to tell it that you want that, but then it does it)

## Ontology

There were two more things mentioned as goals. The first is having a concept of superhuman performance, which the model clearly does. The second is to preserve the meaning of feedback through ontological crisis. Here I think we can make progress, because the concept of the interpreter language gives us a handle on it.

If you transition to a new ontology, you already need to have some way to express what youre doing in the old one. You need to have an explanation of what using the new system consists in and why you think thats a good idea. The interpreter language can express everything the interpreter believes, including this. So you can already talk about the new ontology, if not in it. And you can make statements about statements in the new ontology, like "[NewProposition] is true in NewOntology". For example, humans developed ZFC, and we can make statements like ""1+1=2" is a theorem of ZFC". And X and "X is true" are more or less the same statement, so you already have a handle on particular propositions of the new ontology. In particular, your prior already assigns propabilities to observations meaning propositions of the new ontology (e.g. "ABC means ['1+1=2' is a theorem of ZFC]" is an interpreter-language statement that priors have opinions on). So ontological changes are already formally accounted for, and the process can run through them smoothly. But does this transition to the new ontology correctly? Why would our prior put any significant weight on hypothesis in the new ontology?

In Reductive Reference, Eliezer introduces the idea of promissory notes in semantics:

It seems to me that a word like "snow" or "white" can be taken as a kind of promissory note—not a known specification of exactly which physical quark configurations count as "snow", but, nonetheless, there are things you call snow and things you don't call snow, and even if you got a few items wrong (like plastic snow), an Ideal Omniscient Science Interpreter would see a tight cluster in the center and redraw the boundary to have a simpler definition.

Such a promissory note consists of some ideas what the right ontology to interpret their terms in is, and some ideas about how to interpret them in that ontology once you have it. We can do something like this in our model - and the "some ideas" can themselves come from the speaker for us to interpret. And so the reason our hypothesis pay attention to new ontologies is that the statements themselves demand one to be interpreted in, and which one that is the interpreter will determine in part with its own investigations into the world and mathematics. Now this is a bit different from the standard idea of ontological crisis, which seems to have the impetus for the new ontology come more from the machine side, but insofar as we are worried that something might not survive an ontological shift, we have reasons why we want it to be interpreted in a new ontology in the first place - for example we want to make new information accessible to human values, and so we need to interpret human values in the ontology the information is in. And insofar as we have those reasons we already bring criteria of when and what ontological shifts are necessary for them. I don't know if this covers all worries about ontology, but it certainly seems like a step forward.

## Afoundationalism

After reading, you might have some doubts about the details of my updating strategy. For example it puts equal weight on the interpretation of each proposition- shouldn't we maybe adjust this by how important they are? Or maybe we should have the hypothesis being rated against each other instead of against the total distribution of the prior. These are fair things to think about, but part of our goal with the normativity approach was that we wouldn't have to do that, and indeed I think we don't. This is because we can also interpret meaning itself as a promissory note. That is, what we say about meaning would not be directly interpreted as claims in terms of "means" in the interpreted language. Rather, "meaning" would be a promissory note to be interpreted by a to-be-determined theory of meaning (in line with our communicated criteria about what would make such theories good), which will compile them down to statements in terms of interpreter-meaning. This ties up the last loose ends and gives us a fully afoundational theory of meaning: Not only can it correct mistakes in its input, it can even correct mistakes we made in developing the theory, provided we're not too far off. This seems quite a bit stronger than what was originally expected:

I've updated to a kind of quasi-anti-foundationalist position. I'm not against finding a strong foundation in principle (and indeed, I think it's a useful project!), but I'm saying that as a matter of fact, we have a lot of uncertainty, and it sure would be nice to have a normative theory which allowed us to account for that (a kind of afoundationalist normative theory -- not anti-foundationalist, but not strictly foundationalist, either). This should still be a strong formal theory, but one which requires weaker assumptions than usual (in much the same way reasoning about the world via probability theory requires weaker assumptions than reasoning about the world via pure logic).

But it seems that we don't even need that. While we will of course need a particular formal theory to start the ascent, we don't need to assume that anything in particular about that theory is correct. We just need to believe that its good enough to converge to the correct interpretation. There are propably quite a few ways to get into this dynamic that have the correct theory as a fixed point. The goal then is to find one where its basin of attraction is especially large or easy to understand.

# Ω 7

New Comment

I'm still vague on how the interpretation actually works. What connects the english sentence "it's raining" to epistemology module's rainfall indicator? Why can't "it's raining" be taken to mean the proposition 2+2=4?

I'm not sure what you don't understand, so I'll explain a few things in that area and hope I hit the right one:

I give sentences their english name in the example to make it understandable. Here are two ways you could give more detail on the example scenario, each of which is consistent:

1. "It's raining" is just the english name for a complicated construct in a database query language, used to be understandable. It's connected to the epistemology module because the machine stores its knowledge in that database.
2. Actually, you are the interpreter and I'm the speaker. In that case, english in the interpreter language, and "It's raining" is literally how you interpret observation1. It's connected to your epistemology modules rainfall indicator... somehow? By your knowledge of english? In that example, observation1 might be "Bunthut sent the string "Es regnet"".

Sentences in interpreter language are connected to the epistemology engine simply by supposition. The interpreter language is how the interpreter internally expresses its beliefs, otherwise it's not the interpreter language.

"It's raining" as a sentece of the interpreter language can't be taken to mean "2+2=4" because the interpreter language doesn't need to be interpreted, the interpreter already understands it. "It's raining" as a string sent by the speaker can be taken to mean "2+2=4". It really depends on the prior - if you start out with a prior thats too wrong, you'll end up with nonesense interpretations.

I don't mean the internal language of the interpreter, I mean the external language, the human literally saying "it's raining." It seems like there's some mystery process that connects observations to hypotheses about what some mysterious other party "really means" - but if this process ever connects the observations to propositions that are always true, it seems like that gets most favored by the update rule, and so "it's raining" (spoken aloud) meaning 2+2=4 (in internal representation) seems like an attractor.

It seems like there's some mystery process that connects observations to hypotheses about what some mysterious other party "really means"

The hypothesis do that. I said

We start out with a prior over hypothesis about meaning. Such a hypothesis generates a propability distribution over all propositions of the form "[Observation] means [propositon]." for each observation (including the possibility that the observation means nothing).