AI Alignment at MIRI


Sorted by New


I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the "utility function" abstraction

I have an intuition that the dutch-book arguments still apply in very relevant ways. I mostly want to talk about how maximization appears to be convergent. Let's see how this goes as a comment.

My main point: if you think an intelligent agent forms and pursues instrumental goals, then I think that agent will be doing a lot of maximization inside, and will prefer to not get dutch-booked relative to its instrumental goals.


First, an obvious take on the pizza non-transitivity thing.

If I'm that person desiring a slice of pizza, I'm perhaps desiring it because it will leave me full + taste good + not cost too much.

Is there something wrong with me paying some money to switch the pizza slice back and forth? Well, if the reason I cared about the pizza was that it was low-cost tasty food, then I guess I'm doing a bad job at getting what I care about.

If I enjoy the process of paying for a different slice of pizza, or am indifferent to it, then that's a different story. And it doesn't hurt much to pay 1 cent a couple of times anyway.


Second, suppose I'm trying to get to the moon. How would I go about it?

I might start with estimates about how valuable different suboutcomes are, relative to my attempt to get to the moon. For instance, I might begin with the theory that I need to have a million dollars to get to the moon, and that I'll need to acquire some rocket fuel too.

If I'm trying to get to the moon soon, I will be open to plans that make me money quickly, and teach me how to get rocket fuel. I would also like better ideas about how I should get to the moon, and if you told about how calculus and finite-element-analysis would be useful, I'll update my plans. (And if I were smarter, I might have figured that out on my own.)

If I think that I need a much better grasp of calculus, I might then dedicate some time to learning about it. If you offer me a plan for learning more about calculus, better and faster, I'll happily update and follow it. If I'm smart enough to find a better plan on my own, by thinking, I'll update and follow it.


So, you might think that I can be an intelligent agent, and basically not do anything in my mind that looks like "maximizing". I disagree! In my above parable, it should be clear that my mind is continually selecting options that look better to me. I think this is happening very ubiquitously in my mind, and also in agents that are generally intelligent.

2021 New Year Optimization Puzzles

My best so far on puzzle 1:

Score: 108

This is a variant on  but we get  via , where we implement divide by 2 with sqrt.

Reality-Revealing and Reality-Masking Puzzles

Having a go at pointing at "reality-masking" puzzles:

There was the example of discovering how to cue your students into signalling they understand the content. I think this is about engaging with a reality-masking puzzle that might show up as "how can I avoid my students probing at my flaws while teaching" or "how can I have my students recommend me as a good tutor" or etc.

It's a puzzle in the sense that it's an aspect of reality you're grappling with. It's reality-masking in that the pressure was away from building true/accurate maps.

Having a go at the analogous thing for "disabling part of the epistemic immune system": the cluster of things we're calling an "epistemic immune system" is part of reality and in fact important for people's stability and thinking, but part of the puzzle of "trying to have people be able to think/be agenty/etc" has tended to have us ignore that part of things.

Rather than, say, instinctively trusting that the "immune response" is telling us something important about reality and the person's way of thinking/grounding, one might be looking to avoid or disable the response. This feels reality-masking; like not engaging with the data that's there in a way that moves toward greater understanding and grounding.

AI Alignment Open Thread August 2019

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

AI Alignment Open Thread August 2019

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels." He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, "What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?" I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, "Never mind, Hamming, no one will ever blame you."


Coherent behaviour in the real world is an incoherent concept
First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.

This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)

As a separate note, the "further out" your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.

(This is a reason in my mind to prefer the approval-directed-agent frame, in which humans get to inject preferences that are more about trajectories.)

Diagonalization Fixed Point Exercises

Q7 (Python):

Y = lambda s: eval(s)(s)
Y('lambda s: print("Y = lambda s: eval(s)(s)\\nY({s!r})")')

Q8 (Python):

Not sure about the interpretation of this one. Here's a way to have it work for any fixed (python function) f:

f = 'lambda s: "\\n".join(s.splitlines()[::-1])'

go = 'lambda s: print(eval(f)(eval(s)(s)))'

eval(go)('lambda src: f"f = {f!r}\\ngo = {go!r}\\neval(go)({src!r})"')

Rationalist Lent

I've recently noticed something about me: Attempting to push away or not have experience, actually means pushing away those parts of myself that have that experience.

I then feel an urge to remind readers of a view of Rationalist Lent as an experiment. Don't let it this be another way that you look away from what's real for you. But do let it be a way to learn more about what's real for you.

Beta-Beta Testing: Frontpage Rework [Update - further tweak]

Just a PSA: right-clicking or middle-clicking the posts on the frontpage toggle whether the preview is open. Please make them only expand on left clicks, or equivalent!

Against Instrumental Convergence

Let's go a little meta.

It seems clear that an agent that "maximizes utility" exhibits instrumental convergence. I think we can state a stronger claim: any agent that "plans to reach imagined futures", with some implicit "preferences over futures", exhibits instrumental convergence.

The question then is how much can you weaken the constraint "looks like a utility maximizer", before instrumental convergence breaks? Where is the point in between "formless program" and "selects preferred imagined futures" at which instrumental convergence starts/stops applying?


This moves in the direction of working out exactly which components of utility-maximizing behaviour are necessary. (Personally, I think you might only need to assume "backchaining".)

So, I'm curious: What do you think a minimal set of necessary pieces might be, before a program is close enough to "goal directed" for instrumental convergence to apply?

This might be a difficult question to answer, but it's probably a good way to understand why instrumental convergence feels so real to other people.

Load More