# All of Stuart_Armstrong's Comments + Replies

Which counterfactuals should an AI follow?

I like the subagent approach there.

Counterfactual control incentives

Thanks. I think we mainly agree here.

Preferences and biases, the information argument

No. But I expect that it would be much more in the right ballpark than other approaches, and I think it might be refined to be correct.

Preferences and biases, the information argument

Look at the paper linked for more details ( https://arxiv.org/abs/1712.05812 ).

Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.

Why sigmoids are so hard to predict

"Aha! We seem to be past the inflection point!"

It's generally possible to see where the inflection point is, when we're past it.

2Daniel Kokotajlo2moAh, right, of course. Well, what about when the trend is noisy though? With periods of slower and faster growth? What about "Aha! We are clearly nowhere near the inflection point!"?
Why sigmoids are so hard to predict

Possibly. What would be the equivalent of a dampening term for a superexponential? A further growth term?

2Daniel Kokotajlo2moI don't know, that's one of the things I'm interested in. I guess the situation is something like: There are a bunch of positive feedback loops and a bunch of negative feedback loops. For most of human history, the positives have outweighed the negatives, and the result has been a more or less steady straight line on a log-log plot. [https://www.lesswrong.com/posts/L23FgmpjsTebqcSZb/how-roodman-s-gwp-model-translates-to-tai-timelines] Though the slope of the line changes from period to period, presumably because at some times the positive feedback loops are a lot stronger than the negative and at other times only a little. We know that eventually growth will be limited by the lightspeed expansion of a sphere. Before that, growth might be limited to e.g. a one-month doubling time because that's about as fast as grass can reproduce, or maybe a one-hour doubling time because that's about as fast as microorganisms can reproduce? Idk. Maybe nanotech could double even faster than that. The question is whether there's any way to look at our history so far, our trajectory, and say "Aha! We seem to be past the inflection point!" or something like that. By analogy to the exponentials case you've laid out, my guess is the answer is "no," but I'm hopeful.
Model splintering: moving from one imperfect model to another

But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation

I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.

Model splintering: moving from one imperfect model to another

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).

Model splintering: moving from one imperfect model to another

Thanks! Lots of useful insights in there.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.

3Koen.Holtman3moThe distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent's environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely. I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
Generalised models as a category

Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.

3Koen.Holtman3moDefinitely, it has also been my experience that you can often get new insights by constructing mappings to different models or notations.
Generalised models as a category

Thanks! Corrected both of those; is a subset of .

Stuart_Armstrong's Shortform

Thanks! That's useful to know.

Stuart_Armstrong's Shortform

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines for some but not all and (with for being the whole set of outcomes).

This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of .

Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has .

Eg: , , w... (read more)

3Diffractor1moSounds like a special case of crisp infradistributions (ie, all partial probability distributions have a unique associated crisp infradistribution) Given someQ, we can consider the (nonempty) set of probability distributions equal toQwhereQis defined. This set is convex (clearly, a mixture of two probability distributions which agree withQabout the probability of an event will also agree withQabout the probability of an event). Convex (compact) sets of probability distributions = crisp infradistributions.
4Vanessa Kosoy3moThis is a special case of a crisp infradistribution [https://www.lesswrong.com/s/CmrW8fCmSLK7E25sa/p/idP5E5XhJGh9T5Yq9]: Q(A|B)=t is equivalent to Q(A∩B)=tQ(B), a linear equation in Q, so the set of all Q's satisfying it is convex closed.
Introduction to Cartesian Frames

I like it. I'll think about how it fits with my ways of thinking (eg model splintering).

Counterfactual control incentives

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

1tom4everitt1moGlad she likes the name :) True, I agree there may be some interesting subtleties lurking there. (Sorry btw for slow reply; I keep missing alignmentforum notifications.)
1Koen.Holtman2moOn recent terminology innovation: For exactly the same reason, In my own recent paper Counterfactual Planning [https://arxiv.org/abs/2102.00834], I introduced the termsdirect incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence [https://www.alignmentforum.org/posts/BZKLf629NDNfEkZzJ/creating-agi-safety-interlocks] I develop and apply this terminology in the case of an agent emergency stop button. In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive. I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome. Do they have some standard phrasing where they can say things like 'no value to control' while subtly reminding the reader that 'this does not imply there will be no side effects?'
2Chris_Leong4moFantastic!
AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)

Cheers, that would be very useful.

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

(I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)

I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1

4rohinmshah4moI think I agree at least that many problems can be seen this way, but I suspect that other framings are more useful for solutions. (I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.) What I was claiming in the sentence you quoted was that I don't see ontological shifts as a huge additional category of problem that isn't covered by other problems, which is compatible with saying that ontological shifts can also represent many other problems.
Just another day in utopia

Enjoyed writing it, too.

Extortion beats brinksmanship, but the audience matters

Because a reputation for following up brinksmanship threats means that people won't enter into deals with you at all; extortion works because, to some extent, people have to "deal" with you even if they don't want to.

This is why I saw a Walmart-monopsony (monopolistic buyer) as closer to extortion, since not trading with them is not an option.

Extortion beats brinksmanship, but the audience matters

I'm think of it this way: investigating a supplier to check they are reasonable costs $1 to Walmart. The minimum price any supplier will offer is$10. After investigating, one supplier offers $10.5. Walmart refuses, knowing the supplier will not got lower, and publicises the exchange. The reason this is extortion, at least in the sense of this post, is that Walmart takes a cost (it will cost them at least$11 to investigate and hire another supplier) in order to build a reputation.

1Darmani6moOkay. I think you're saying this is extortion because Walmart's goal is to build a reputation for only agreeing to deals absurdly favorable to them. If the focus on building a reputation is the distinguishing factor, then how does that square with the following statement: "it is not useful for me to have a credible reputation for following up on brinksmanship threats?"
Extortion beats brinksmanship, but the audience matters

The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade.

Extortion beats brinksmanship, but the audience matters

I think the distinction is, from the point of view of the extortioner, "would it be in my interests to try and extort , *even if I know for a fact that cannot be extorted and would force me to act on my threat, to the detriment of myself in that situation?"

If the answer is yes, then it's extortion (in the meaning of this post). Trying to extort the un-extortable, then acting on the threat, makes sense as a warning to other.

1Darmani6moI see. In that case, I don't think the Walmart scenario is extortion. It is not to the detriment of Walmart to refuse to buy from a supplier who will not meet their demands, so long as they can find an adequate supplier who will.
Extortion beats brinksmanship, but the audience matters

That's a misspelling that's entirely my fault, and has now been corrected.

Extortion beats brinksmanship, but the audience matters

(1) You say that releasing nude photos is in the blackmail category. But who's the audience?

The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.

(2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices?

Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy ... (read more)

2Darmani6moI'd really appreciate a more rigorous/formal/specific definition of both. I'm not seeing what puts the Walmart example in the "extortion" category, and, without a clear distinction, this post dissolves.
3romeostevensit6moReleasing one photo from a previously believed to be secure set of photos, where other photos in the same set are compromising can suffice for single member audience case.
Humans are stunningly rational and stunningly irrational

"within the limits of their intelligence" can mean anything, excuse any error, bias, and failure. Thus, they are not rational, and (form one perspective) very very far from it.

1Nacruno966moFor a human being a view can be right, wrong or as a third option the energy you would need to put in deciding if something is right or wrong is high to the extent that that it would not make sense to go all the way trying to come up with an answer. This is exactly the case with establishing a V value. Unless you are all knowing, hence a god. Therefore given the means humans have, what they are doing is not quite that irrational. You can not be rational beyond your means. You can not say a human is irrational because he doesn’t fly away if he sees a lion. Because there are limits. And every being can just be rational or irrational within his limits. Irrationality within your limits would be to go into a forest without a gun, for example.
Knowledge, manipulation, and free will

Some people (me included) value a certain level of non-manipulation. I'm trying to cash out that instinct. And it's also needed for some ideas like corrigibility. Manipulation also combines poorly with value learning, see eg our paper here https://arxiv.org/abs/2004.13654

I do agree that saving the world is a clearly positive case of that ^_^

The Presumptuous Philosopher, self-locating information, and Solomonoff induction

I have an article on "Anthropic decision theory". with the video version here.

Basically, it's not that the presumptuous philosopher is more likely to be right in a given universe, its that there are far more presumptuous philosophers in the large universe. So if we count "how many presumptuous philosophers are correct", we get a different answer to "in how many universes is the presumptuous philosopher correct". These things only come apart in anthropic situations.

Comparing reward learning/reward tampering formalisms

Suart, by " is complex" are you referring to...

I mean that that defining can be done in many different ways, and hence has a lot of contingent structure. In contrast, in , the \$\rho is a complex distribution on , conditional on ; hence itself is trivial and just encodes "apply to and in the obvious way.

Stuart_Armstrong's Shortform

This is a link to "An Increasingly Manipulative Newsfeed" about potential social media manipulation incentives (eg FaceBook).

I'm putting the link here because I keep losing the original post (since it wasn't published by me, but I co-wrote it).

Anthropomorphisation vs value learning: type 1 vs type 2 errors

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

2Steven Byrnes7moGotcha, thanks. I have corrected my comment two above [https://www.lesswrong.com/posts/LkytHQSKbQFf6toW5/anthropomorphisation-vs-value-learning-type-1-vs-type-2?commentId=aN6BRpDtqC6pHg7eG#2jhXuQhx7d2qPBKx5] by striking out the words "boundedly-rational", but I think the point of that comment still stands.
Anthropomorphisation vs value learning: type 1 vs type 2 errors

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

4Steven Byrnes8moSorry for the stupid question, but what's the difference between "boundedly-rational agent pursuing a reward function" and "any sort of agent pursuing a reward function"?
Anthropomorphisation vs value learning: type 1 vs type 2 errors

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical... (read more)

2Steven Byrnes8moIt's your first day working at the factory, and you're assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, "Looks like it's flooping again," whacks it, and then says "I think that fixed it". This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it's truly flooping, vs sorta flooping, vs not flooping. By the same token, you could give some labeled examples of "wants to take a walk" to the aliens, and they can find what those examples have in common and develop a concept of "wants to take a walk", albeit with edge cases. Then you can also give labeled examples of "wants to braid their hair", "wants to be accepted", etc., and after enough cycles of this, they'll get the more general concept of "want", again with edge cases. I don't think I'm saying anything that goes against your Occam's razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function", and proved that there's no objectively best way to do it, where "objectively best" includes things like fidelity and simplicity. (My perspective on that is, "Well yeah, duh, humans are not boundedly-rational agents pursuing a utility function! The model doesn't fit! There's no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn't fit except insofar as the model is tautologically applicable to anything [https://www.lesswrong.com/s/4dHMdK5TLN6xcqtyc/p/NxF5G6CJiof6cemTw])") I don't see how the paper rules o
Anthropomorphisation vs value learning: type 1 vs type 2 errors

For instance throughout history people have been able to model and interact with traders from neighbouring or distant civilizations, even though they might think very differently.

Humans think very very similarly to each other, compared with random minds from the space of possible minds. For example, we recognise anger, aggression, fear, and so on, and share a lot of cultural universals https://en.wikipedia.org/wiki/Cultural_universal

1spkoc8moIs the space of possible minds really that huge(or maybe really that alien?), though? I agree about humans having ... an instinctive ability to intuit the mental state of other humans. But isn't that partly learnable as well? We port this simulation ability relatively well to animals once we get used to their tells. Would we really struggle to learn the tells of other minds, as long as they were somewhat consistent over time and didn't have the ability to perfectly lie? Like what's a truly alien mind? At the end of the day we're Turing complete, we can simulate any computational process, albeit inefficiently.
Why haven't we celebrated any major achievements lately?

There haven’t been as many big accomplishments.

I think we should look at the demand side, not the supply side. We are producing lots of technological innovations, but there aren't so many major problems left for them to solve. The flush toilet was revolutionary; a super-flush ecological toilet with integrated sensors that can transform into a table... is much more advanced from the supply side, but barely more from the demand side: it doesn't fulfil many more needs than the standard flush toilet.

6jasoncrawford8moI disagree, there are plenty of problems left to solve: https://rootsofprogress.org/the-plight-of-the-poor
Learning human preferences: black-box, white-box, and structured white-box access

Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.

Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).

So we need to figure out if we're in the optimistic or the pessimistic scenario.

Learning human preferences: black-box, white-box, and structured white-box access

My understanding of the OP was that there is a robot [...]

That understanding is correct.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also po... (read more)

Learning human preferences: black-box, white-box, and structured white-box access

modularization is super helpful for simplifying things.

The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).

2John_Maxwell8moLet's say I'm trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides. The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.
Learning human preferences: black-box, white-box, and structured white-box access

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?

2G Gordon Worley III8moAny model is going to be in the head of some onlooker. This is the tough part about the white box approach: it's always an inference about what's "really" going on. Of course, this is true even of the boundaries of black boxes, so it's a fully general problem. And I think that suggests it's not a problem except insofar as we have normal problems setting up correspondence between map and territory.
Learning human preferences: optimistic and pessimistic scenarios

Thanks! Useful insights in your post, to mull over.

2SDM9moGlad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of [https://www.lesswrong.com/posts/hjPEw6HDnzmNvZAcH/sdm-s-shortform?commentId=epe7nrLbiP7tc3AqW] (potentially mistaken) normative assumptions you need in order to model a single human's preferences. The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we're then trying to aggregate their preferences (with each human's preference represented by a separate model) I think the same basic principle applies, that you can reduce the normative assumptions you need by using a more complicated voting mechanism, in this case one that considers agents' ability to vote strategically as an opportunity to reach stable outcomes. I talk about this idea here [https://www.lesswrong.com/posts/hjPEw6HDnzmNvZAcH/sdm-s-shortform?commentId=epe7nrLbiP7tc3AqW] . As with using approval/actions to improve the elicitation of an individual's preferences, you can't avoid making any normative assumptions by using a more complicated aggregation method, but perhaps you end up having to make fewer of them. Very speculatively, if you can combine a robust method of eliciting preferences with few inbuilt assumptions with a similarly robust method of aggregating preferences, you're on your way to a full solution to ambitious value learning [https://www.lesswrong.com/posts/5eX8ko7GCxwR5N9mN/what-is-ambitious-value-learning] .
Learning human preferences: optimistic and pessimistic scenarios

An imminent incoming post on this very issue ^_^