Ramana Kumar

Wiki Contributions


Coherence arguments do not entail goal-directed behavior

Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/observations and actions.

This is not the type signature for a utility function that matters for the coherence arguments (by which I don't mean VNM - see this comment). It does often fit the type signature in the way those arguments are formulated/formalised, but intuitively, it's not getting at the point of the theorems. I suggest you consider utility functions defined as functions of the state of the world only, not including the action taken. (Yes I know actions could be logged in the world state, the agent is embedded in the state, etc. - this is all irrelevant for the point I'm trying to make - I'm suggesting to consider the setup where there's a Cartesian boundary, an unknown transition function, and environment states that don't contain a log of actions.) I don't think the above kind of construction works in that setting. I think that's the kind of setting it's better to focus on.

Life and expanding steerable consequences

Now, four billion years later, we are about to set in motion a second seed.

We can also view this as part of the effects of the initial seed.


I'm a little confused about what the large-scale predictable, and steerable, consequences of the initial seed are. For predictable consequences, I can imagine things like proliferation of certain kinds of molecules (like proteins). But where's the steerability?

Ngo and Yudkowsky on alignment difficulty

A couple of direct questions I'm stuck on:

  • Do you agree that Flint's optimizing systems are a good model (or even definition) of outcome pumps?
  • Are black holes and fires reasonable examples of outcome pumps?

I'm asking these to understand the work better.

Currently my answers are:

  • Yes. Flint's notion is one I came to independently when thinking about "goal-directedness". It could be missing some details, but I find it hard to snap out of the framework entirely.
  • Yes. But maybe not the most informative examples. They're highly non-retargetable.
Yudkowsky and Christiano discuss "Takeoff Speeds"

I wonder what effect there is from selecting for reading the third post in a sequence of MIRI conversations from start to end and also looking at the comments and clicking links in them.

Ngo and Yudkowsky on alignment difficulty

Thanks for the replies! I'm still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.

What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?

To clarify a little what the options in my original comment were, I'll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually nearby (in spacetime) when water is in a small space without having spilled, and Option 2 corresponds to the characteristic funnel shape (in combination with facts about physical laws maybe).


I think your and Eliezer's replies are pointing me at a sense in which both Option 1 and Option 2 are correct, but they are used in different ways in the overall story. To tell this story, I want to draw a distinction between outcome-pumps (behavioural agents) and consequentialists (structural agents). Outcome-pumps are effective at achieving outcomes, and this effectiveness is measured according to our models (option 1). Consequentialists do (or have done in their causal history) the work of selecting actions according to expected consequences in coherent pursuit of an outcome, and the expected consequences are therefore their own (option 2). 

Spelling this out a little more - Outcome-pumps are optimizing systems: there is a space of possible configurations, a much smaller target subset of configurations, and a basin of attraction such that if the system+surroundings starts within the basin, it ends up within the target. There are at least two ways of looking at the configuration space. Firstly, there's the range of situations in which we actually observe the same (or similar) outcome-pump system and that it achieved its outcome. Secondly, there's the range of hypothetical possibilities we can imagine and reason about putting the outcome-pump system into, and extrapolating (using our own models) that it will achieve the outcome. Both of these ways are "Option 1".

Consequentialists (structural agents) do the work, somewhere somehow - maybe in their brains, maybe in their causal history, maybe in other parts of their structure and history - of maintaining and updating beliefs and selecting actions that lead to (their modelled) expected consequences that are high in their preference ordering (this is all Option 2).

It should be somewhat uncontroversial that consequentialists are outcome pumps, to the extent that they’re any good at doing the consequentialist thing (and have sufficiently achievable preferences relative to their resources etc).

The more substantial claim I read MIRI as making is that outcome pumps are consequentialists, because the only way to be an outcome pump is to be a consequentialist. Maybe you wouldn't make this claim so strongly, since there are counterexamples like fires and black holes -- and there may be some restrictions on what kind of outcome pumps the claim applies to (such as some level of retargetability or robustness?).

How does this overall take sound?

Scott Garrabrant’s question on whether agent-like behaviour implies agent-like architecture seems pretty relevant to this whole discussion -- Eliezer, do you have an answer to that question? Or at least do you think it’s an important open question?

Ngo and Yudkowsky on alignment difficulty

A couple of other arguments the non-MIRI side might add here:

  • The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)
  • How do we imagine scaled up humans [e.g. thinking faster, thinking in more copies, having more resources, or having more IQ] to be effective? Wouldn’t they be corrigible? Wouldn't they have nice goals? What can we learn from the closest examples we already have of scaled up humans? (h/t Shahar for bringing this point up in conversation).
Ngo and Yudkowsky on alignment difficulty

Here Daniel Kokotajlo and I try to paraphrase the two sides of part of the disagreement and point towards a possible crux about the simplicity of corrigibility.

We are training big neural nets to be effective. (More on what effective means elsewhere; it means something like “being able to steer the future better than humans can.”) We want to have an effective&corrigible system, and we are worried that instead we’ll get an effective&deceptive system. Ngo, Shah, etc. are hopeful that it won’t be “that hard” to get the former and avoid the latter; maybe if we just apply selection pressure in the various ways that have been discovered so far (adversarial training, oversight, process-based feedback, etc.) it’ll work. Yudkowsky is more pessimistic; he thinks that the ways that have been discovered so far really don’t seem good enough. Instead of creating an effective&corrigible system, they’ll create either an ineffective&corrigible system, or an effective&deceptive system that deceives us into thinking it is corrigible.

What are the arguments they give for their respective positions?

Yudkowsky (we think) says that corrigibility is both (a) significantly more complex than deception, and (b) at cross-purposes to effectiveness.

Ngo and Yudkowsky on alignment difficulty

I am interested in the history-funnelling property -- the property of being like a consequentialist, or of being effective at achieving an outcome -- and have a specific confusion I'd love to get insight on from anyone who has any.

Question: Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

Option 1 (robustness/behavioural/our models): They achieve narrow outcomes with respect to an externally specified set of counterfactuals. E.g., relative to what we consider "could have happened", the consequentialists selected an excellent course of action for their purposes. This would make consequentialists optimizing systems in Flint's sense.

Option 2 (agency/structural/their models): They are structured in such a way that they do their own considering and evaluating and deciding. We observe mechanisms that implement the processes of predicting and evaluating outcomes in these systems (and/or their history). So the possibilities that are narrowed down are the consequentialist's possibilities, the counterfactuals are produced by their models which may or may not line up with some externally specified ones (like ours).

I mostly think Yudkowsky is referring to Option 2, but I get confused by phrases (e.g. from Soares's summary) like "manage to actually funnel history" or "apparent consequentialism", that seem to me to make most sense under Option 1.

Optimization Concepts in the Game of Life

Thanks! I'd had a bit of a look through that book before and agree it's a great resource. One thing I wasn't able to easily find is examples of robust patterns. Does anyone know if there's been much investigation of robustness in the Life community? The focus I've seen seems to be more on particular constructions (used in its entirety as the initial state for a computation), rather than on how patterns fare when placed in various ranges of different contexts.

Recommending Understand, a Game about Discerning the Rules

Can this be played on Linux? I was about to buy on Steam but noticed it only listed Windows and macOS system requirements and I don't want to buy something that'll sit in my library unplayable.

Load More