All of jbkjr's Comments + Replies

Redwood Research’s current project

I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.

My life is a relentless sequence of exercises in importance sampling and counterfactual analysis

This made me laugh out loud :P

2Buck1moThanks, glad to hear you appreciate us posting updates as we go.
Grokking the Intentional Stance

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an

... (read more)
3rohinmshah1moYeah, I agree with all of that.
Goal-Directedness and Behavior, Redux

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.

  • (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
  • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavi
... (read more)
3adamShimi2moI'm glad, you're one of the handful of people I wrote this post for. ;) Definitely. I have tended to neglect this angle, but I'm trying to correct that mistake.
How many parameters do self-driving-car neural nets have?

I think they do some sort of distillation type thing where they train massive models to label data or act as “overseers” for the much smaller models that actually are deployed in cars (as inference time has to be much better to make decisions in real time)… so I wouldn’t actually expect them to be that big in the actual cars. More details about this can be found in Karpathy’s recent CLVR talk, iirc, but not about parameter count/model size?

Re-Define Intent Alignment?

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Probably something like the last one, although I think "even in principle" is doing so... (read more)

1Edouard Harris3moI'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories. But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to the territory. I'm not suggesting actually trying to imbue an AI with such concepts — that would be dangerous (for the reasons you alluded to) even if it wasn't pointless (because prosaic systems will just learn the representations they need anyway). All I'm saying is that the moment we started playing the game of definitions, we'd already started playing the game of maps. So using an arbitrary demarcation to construct our definitions might be bad for any number of legitimate reasons, but it can't be bad just because it caused us to start using maps: our earlier decision to talk about definitions already did that. (I'm not 100% sure if I've interpreted your objection correctly, so please let me know if I haven't.)
Re-Define Intent Alignment?

(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions li... (read more)

1Edouard Harris3moYeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe. To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?
Re-Define Intent Alignment?

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, bec... (read more)

1Edouard Harris3moAh I see! Thanks for clarifying. Yes, the point about the Cartesian boundary is important. And it's completely true that any agent / environment boundary we draw will always be arbitrary. But that doesn't mean one can't usefully draw such a boundary in the real world — and unless one does, it's hard to imagine how one could ever generate a working definition of something like a mesa-objective. (Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?") Of course the right question will always be [https://www.alignmentforum.org/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-1] : "what is the whole universe optimizing for?" But it's hard to answer that! So in practice, we look at bits of the whole universe that we pretend are isolated. All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about. (i.e. I agree with you that duality is a useful fiction, just saying that we can still use it to construct useful definitions.)
An Orthodox Case Against Utility Functions

I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p

You might find Joscha Bach's view interesting...

Refactoring Alignment (attempt #2)

I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.

This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been e... (read more)

2abramdemski3moSeems fair. I'm similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.
Re-Define Intent Alignment?

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and

... (read more)
2abramdemski3moRight, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)
Did they or didn't they learn tool use?

One idea as to the source of the potential discrepancy... did any of the task prompts for the tasks in which it did figure out how to use tools tell it explicitly to "use the objects to reach a higher floor," or something similar? I'm wondering if the cases where it did use tools are examples where doing so was instrumentally useful to achieving a prompted objective that didn't explicitly require tool use.

2Daniel Kokotajlo3moNone of the prompts tell it what to do, they aren't even in english. (Or so I think? correct me if I'm wrong!) Instead they are in propositional logic, using atoms that refer to objects, colors, relations, and players. They just give the reward function in disjunctive normal form (i.e. big chain of disjunctions) and present it to the agent to observe.
Refactoring Alignment (attempt #2)

I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense.

Same, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?

These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do wha

... (read more)
2rohinmshah3moThat seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call "outer alignment"). I probably should have mentioned that too, I was taking it as a given but I really shouldn't have. For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.
Refactoring Alignment (attempt #2)

Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)

This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective.

I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless... (read more)

2abramdemski3moI too am a fan of broadening this a bit, but I am not sure how to. I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers. I agree with your point about using "does this definition include humans" as a filter, and I think it would be easy to mess that up (and I wasn't thinking about it explicitly until you raised the point). However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.
Re-Define Intent Alignment?

The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.

The key here is that the set of allowed m

... (read more)

which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment

 

I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agen... (read more)

3abramdemski3moThis makes some sense, but I don't generally trust some "perturbation set" to in fact capture the distributional shift which will be important in the real world. There has to at least be some statement that the perturbation set is actually quite broad. But I get the feeling that if we could make the right statement there, we would understand the problem in enough detail that we might have a very different framing. So, I'm not sure what to do here.
Re-Define Intent Alignment?

However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans".

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI peop... (read more)

2abramdemski3moFor myself, my reaction is "behavioral objectives also assume a system is well-described as EU maximizers". In either case, you're assuming that you can summarize a policy by a function it optimizes; the difference is whether you think the system itself thinks explicitly in those terms. I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems. For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function. In this picture, there is no clear distinction between terminal values and instrumental values. Something is "more terminal" if you treat it as more fixed (you resolve contradictions by updating the other values), and "more instrumental" if its value is more changeable based on other things. (Possibly you should consider my "approximately coherent expectations" idea)
Refactoring Alignment (attempt #2)

So, for example, this claims that either intent alignment + objective robustness or outer alignment + robustness would be sufficient for impact alignment.

Shouldn’t this be “intent alignment + capability robustness or outer alignment + robustness”?

Btw, I plan to post more detailed comments in response here and to your other post, just wanted to note this so hopefully there’s no confusion in interpreting your diagram.

2abramdemski3moYep, fixed.
Looking Deeper at Deconfusion

Great post. My one piece of feedback is that not calling the post "Deconfusing 'Deconfusion'" might've been a missed opportunity. :)

I even went to this cooking class once where the chef proposed his own deconfusion of the transformations of food induced by different cooking techniques -- I still use it years later.

Unrelatedly, I would be interested in details on this.

2adamShimi3moTo be fair, that was the original title. But after talking with Nate, I agreed that this perspective, although quite useful IMO, falls short of deconfusion because it hasn't paid its due in making the application (doing deconfusion) better/easier yet. Doesn't mean I don't expect it to eventually. :)
Why Subagents?

The way I'd think of it, it's not that you literally need unanimous agreement, but that in some situations there may be subagents that are strong enough to block a given decision.

Ah, I think that makes sense. Is this somehow related to the idea that the consciousness is more of a "last stop for a veto from the collective mind system" for already-subconsciously-initiated thoughts and actions? Struggling to remember where I read this, though.

It gets a little handwavy and metaphorical but so does the concept of a subagent.

Yeah, considering the fact tha... (read more)

Why Subagents?

Wouldn't decisions about e.g. which objects get selected and broadcast to the global workspace be made by a majority or plurality of subagents? "Committee requiring unanimous agreement" feels more like what would be the case in practice for a unified mind, to use a TMI term. I guess the unanimous agreement is only required because we're looking for strict/formal coherence in the overall system, whereas e.g. suboptimally-unified/coherent humans with lots of akrasia can have tug-of-wars between groups of subagents for control.

2Kaj_Sotala3moThe way I'd think of it, it's not that you literally need unanimous agreement, but that in some situations there may be subagents that are strong enough to block a given decision. And then if you only look at the subagents that are strong enough to exert a major influence on that particular decision (and ignore the ones either who don't care about it or who aren't strong enough to make a difference), it kind of looks like a committee requiring unanimous agreement. It gets a little handwavy and metaphorical but so does the concept of a subagent. :)
Why Subagents?

The arrows show preference: our agent prefers A to B if (and only if) there is a directed path from A to B along the arrows.

Shouldn't this be "iff there is a directed path from B to A"? E.g. the agent prefers pepperoni to cheese, so there is a directed arrow from cheese to pepperoni.

4johnswentworth3moNice catch, thanks.
Taboo "Outside View"

Great post. That Anakin meme is gold.

“Whenever you notice yourself saying ‘outside view’ or ‘inside view,’ imagine a tiny Daniel Kokotajlo hopping up and down on your shoulder chirping ‘Taboo outside view.’”

Somehow I know this will now happen automatically whenever I hear or read “outside view.” 😂

The Hard Work of Translation (Buddhism)

The Buddha taught one specific concentration technique and a simple series of insight techniques

Any pointers on where I can find information about the specific techniques as originally taught by the Buddha?

2romeostevensit5mohttps://www.accesstoinsight.org/tipitaka/mn/mn.118.than.html [https://www.accesstoinsight.org/tipitaka/mn/mn.118.than.html] on interpretations:https://en.wikipedia.org/wiki/Ānāpānasati_Sutta [https://en.wikipedia.org/wiki/%C4%80n%C4%81p%C4%81nasati_Sutta] insight techniques:https://en.wikipedia.org/wiki/Satipatthana [https://en.wikipedia.org/wiki/Satipatthana]
A non-mystical explanation of "no-self" (three characteristics series)

I've found this interview with Richard Lang about the "headless" method of interrogation helpful and think Sam's discussion provides useful context to bridge the gap to the scientific skeptics as well as to other meditative techniques and traditions (some of which are touched upon in this post). It also includes a pointing out exercise.

2abramdemski5moThanks!
Deducing Impact

Late to the party, but here's my crack at it (ROT13'd since markdown spoilers made it an empty box without my text):

Fbzrguvat srryf yvxr n ovt qrny vs V cerqvpg gung vg unf n (ovt) vzcnpg ba gur cbffvovyvgl bs zl tbnyf/inyhrf/bowrpgvirf orvat ernyvmrq. Nffhzvat sbe n zbzrag gung gur tbnyf/inyhrf ner jryy-pncgherq ol n hgvyvgl shapgvba, vzcnpg jbhyq or fbzrguvat yvxr rkcrpgrq hgvyvgl nsgre gur vzcnpgshy rirag - rkcrpgrq hgvyvgl orsber gur rirag. Boivbhfyl, nf lbh'ir cbvagrq bhg, fbzrguvat orvat vzcnpgshy nppbeqvat gb guvf abgvba qrcraqf obgu ba gur inyhrf naq ba ubj "bowrpgviryl vzcnpgshy" vg vf (v.r. ubj qenfgvpnyyl vg punatrf gur frg bs cbffvoyr shgherf).

Old post/writing on optimization daemons?

Ah, it was John's post I was thinking of; thanks! (Apologies to John for not remembering it was his writing, although I suppose mistaking someone's visual imagery on a technical topic for Eliezer's might be considered an accidental compliment :).)