Research analyst at Open Philanthropy. Doctoral student in philosophy at the University of Oxford. Opinions my own.
Cool (though FWIW, if you're going to lean on the notion of policies being aligned with humans, I'd be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I'm assuming you have in mind something like "a policy is aligned with humans if an agent implementing that policy is aligned with humans.").
Regardless, sounds like your definition is pretty similar to: "An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn't act in ways that humans judge bad"? If you see it as importantly different from this, I'd be curious.
Aren't they now defined in terms of each other?
"Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.
Outer alignment: An objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned."
Thanks for writing this up. Quick question re: "Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans." What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: "An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic." But you don't say explicitly what it is for an objective to be aligned: I'm curious if you have a preferred formulation.
Is it something like: “the behavioral objective is such that, when the agent does ‘well’ on this objective, the agent doesn’t act in a way we would view as bad/problematic/dangerous/catastrophic." If so, it seems like a lot might depend on exactly how “well” the agent does, and what opportunities it has in a given context. That is, an “aligned” agent might not stay aligned if it becomes more powerful, but continues optimizing for the same objective (for example, a weak robot optimizing for beating me at chess might be "aligned" because it only focuses on making good chess moves, but a stronger one might not be, because it figures out how to drug my tea). Is that an implication you’d endorse?
Or is the thought something like: "the behavioral objective such that, no matter how powerfully the agent optimizes for it, and no matter its opportunities for action, it doesn't take actions we would view as bad/problematic/dangerous/catastrophic"? My sense is that something like this is often the idea people have in mind, especially in the context of anticipating things like intelligence explosions. If this is what you have in mind, though, maybe worth saying so explicitly, since intent alignment in this sense seems like a different constraint than intent alignment in the sense of e.g. "the agent's pursuit of its behavioral objective does not in fact give rise to bad actions, given the abilities/contexts/constraints that will in fact be relevant to its behavior."
Interesting; I hadn't really considered that angle. Seems like this could also apply to other mental phenomena that might seem self-recommending (pleasure? rationality?), but which plausibly have other, more generally adaptive functions as well, so I would continue to wonder about other functions regardless.
I meant mental states in something more like the #1 sense -- and so, I think, does Frankish.
My sense is that the possibility of dynamics of this kind would be on people's radar in the philosophy community, at least.
Thanks :). I do think clinging often functions as an unnoticed lens on the world; though noticing it, in my experience, is also quite distinct from it "releasing." I also would've thought that depression can be an unnoticed (or at least, unquestioned) lens as well: e.g., a depressed person who is convinced that everything in the world is bad, that they'll never feel better again, etc.
Glad to hear you found it useful.
Thanks :) Re blog name: it isn't: "Hands" comes from a Martin Buber quote, and "Cities" from a phrase I believe I heard from A.J. Julius. I chose them partly as a personal reminder about the blog's aims.
That's the one :)