AI Alignment at MIRI


Sorted by New


Reality-Revealing and Reality-Masking Puzzles

Having a go at pointing at "reality-masking" puzzles:

There was the example of discovering how to cue your students into signalling they understand the content. I think this is about engaging with a reality-masking puzzle that might show up as "how can I avoid my students probing at my flaws while teaching" or "how can I have my students recommend me as a good tutor" or etc.

It's a puzzle in the sense that it's an aspect of reality you're grappling with. It's reality-masking in that the pressure was away from building true/accurate maps.

Having a go at the analogous thing for "disabling part of the epistemic immune system": the cluster of things we're calling an "epistemic immune system" is part of reality and in fact important for people's stability and thinking, but part of the puzzle of "trying to have people be able to think/be agenty/etc" has tended to have us ignore that part of things.

Rather than, say, instinctively trusting that the "immune response" is telling us something important about reality and the person's way of thinking/grounding, one might be looking to avoid or disable the response. This feels reality-masking; like not engaging with the data that's there in a way that moves toward greater understanding and grounding.

AI Alignment Open Thread August 2019

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

AI Alignment Open Thread August 2019

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels." He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, "What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?" I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, "Never mind, Hamming, no one will ever blame you."


Coherent behaviour in the real world is an incoherent concept
First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.

This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)

As a separate note, the "further out" your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.

(This is a reason in my mind to prefer the approval-directed-agent frame, in which humans get to inject preferences that are more about trajectories.)

Diagonalization Fixed Point Exercises

Q7 (Python):

Y = lambda s: eval(s)(s)
Y('lambda s: print("Y = lambda s: eval(s)(s)\\nY({s!r})")')

Q8 (Python):

Not sure about the interpretation of this one. Here's a way to have it work for any fixed (python function) f:

f = 'lambda s: "\\n".join(s.splitlines()[::-1])'

go = 'lambda s: print(eval(f)(eval(s)(s)))'

eval(go)('lambda src: f"f = {f!r}\\ngo = {go!r}\\neval(go)({src!r})"')

Rationalist Lent

I've recently noticed something about me: Attempting to push away or not have experience, actually means pushing away those parts of myself that have that experience.

I then feel an urge to remind readers of a view of Rationalist Lent as an experiment. Don't let it this be another way that you look away from what's real for you. But do let it be a way to learn more about what's real for you.

Beta-Beta Testing: Frontpage Rework [Update - further tweak]

Just a PSA: right-clicking or middle-clicking the posts on the frontpage toggle whether the preview is open. Please make them only expand on left clicks, or equivalent!

Against Instrumental Convergence

Let's go a little meta.

It seems clear that an agent that "maximizes utility" exhibits instrumental convergence. I think we can state a stronger claim: any agent that "plans to reach imagined futures", with some implicit "preferences over futures", exhibits instrumental convergence.

The question then is how much can you weaken the constraint "looks like a utility maximizer", before instrumental convergence breaks? Where is the point in between "formless program" and "selects preferred imagined futures" at which instrumental convergence starts/stops applying?


This moves in the direction of working out exactly which components of utility-maximizing behaviour are necessary. (Personally, I think you might only need to assume "backchaining".)

So, I'm curious: What do you think a minimal set of necessary pieces might be, before a program is close enough to "goal directed" for instrumental convergence to apply?

This might be a difficult question to answer, but it's probably a good way to understand why instrumental convergence feels so real to other people.

Against Instrumental Convergence

Hm, I think an important piece of "intuitionistic proof" didn't transfer, or is broken. Drawing attention to that part:

Regardless of the details of how "decisions" are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power.

So here, I realize, I am relying on something like "the AI implicitly moves toward an imagined realizable future". I think that's a lot easier to get than the pipeline you sketch.

I think I'm being pretty unclear - I'm having trouble conveying my thought structure here. I'll go make a meta-level comment instead.

Against Instrumental Convergence

I think there's an important thing to note, if it doesn't already feel obvious: the concept of instrumental convergence applies to roughly anything that exhibits consequentialist behaviour, i.e. anything that does something like backchaining in its thinking.

Here's my attempt at a poor intuitionistic proof:

If you have some kind of program that understands consequences or backchains or etc, then perhaps it's capable of recognizing that "acquire lots of power" will then let it choose from a much larger set of possibilities. Regardless of the details of how "decisions" are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power. And thus I'm worried about "instrumental convergence".


At this point, I'm already much more worried about instrumental convergence, because backchaining feels damn useful. It's the sort of thing I'd expect most competent mind-like programs to be using in some form somewhere. It certainly seems more plausible to me that a random mind does backchaining, than a random mind looks like "utility function over here" and "maximizer over there".

(For instance, even setting aside how AI researchers are literally building backchaining/planning into RL agents, one might expect most powerful reinforcement learners to benefit a lot from being able to reason in a consequentialist way about actions. If you can't literally solve your domain with a lookup table, then causality and counterfactuals let you learn more from data, and better optimize your reward signal.)


Finally, I should point at some relevant thinking around how consequentialists probably dominate the universal prior. (Meaning: if you do an AIXI-like random search over programs, you get back mostly-consequentialists). See this post from Paul, and a small discussion on agentfoundations.

Load More