Wiki Contributions



I agree that motivation should reduce to low-level, primitive things, and also that changing the agent's belief about where the cheese is lets you retarget behavior. However, I don't expect edits to beliefs to let you scalably control what the agent does, in that if it's smart enough and making sufficiently complicated plans you won't have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say "abstract class of behavior" I mean things like "put the red balls in the blue basket" or "pet all the cats in the environment." 

It also seems plausible to me that there exist parts of the agent that do allow for scalable control through modification, and this is what I would refer to as "the values" (the classic example here is a utility function, though things like RL agents might not have those). 

But maybe you're studying the structure of motivational circuitry with a downstream objective other than "scalable control," in which case this objection doesn't necessarily apply.

My main advice to avoid this failure mode is to leverage your Pareto frontier. Apply whatever knowledge, or combination of knowledge, you have which others in the field don’t.


This makes sense if you already have knowledge which other people don't, but what about if you don't? How much should "number of people in the alignment community who already know X thing" factor into what you decide to study, relative to other factors like "how useful is X thing, when you ignore what everyone else is doing?" For instance, there are probably fewer people who know a lot about geology than who know a lot about economics, but I would expect that learning about economics would still be more valuable for doing agent foundations research.

(My guess is that the answer is "don't worry much at all about the Pareto frontier stuff when deciding what to study," especially because there aren't that many alignment researchers anyways, but I'm not actually sure.) 


Great post! I'm looking forward to seeing future projects from Team Shard. 

I'm curious why you frame channel 55 as being part of the agent's "cheese-seeking motivation," as opposed to simply encoding the agent's belief about where the cheese is. Unless I'm missing something, I'd expect the latter to be as or more likely - in that when you change the cheese's location, the thing that should straightforwardly change is the agent's model of the cheese's location.

Frames can be wrong, and using a wrong frame is costly, even if especially if everyone agrees on the frame.


It seems to me that having a wrong shared frame when studying a problem might still be useful as long as it's not too wrong (as long as the way it divides the world up isn't too far away from the "real" lines), because the world is high-dimensional and having a frame makes thinking about it more tractable. And it can be useful to share this wrong-but-not-too-wrong frame with other people because then you and your colleagues can talk to each other more easily.

Maybe it's the "everyone agrees" part that's doing the work here? It seems important that people acknowledge that they're using the frame because it's useful and not because there's strong evidence for it being correct. Or maybe you consider all useful frames to be correct frames?

Curious to know if you disagree.