Jessica Taylor. CS undergrad and Master's at Stanford; former research fellow at MIRI.

I work on decision theory, social epistemology, strategy, naturalized agency, mathematical foundations, decentralized networking systems and applications, theory of mind, and functional programming languages.



Wiki Contributions


All you need is to construct an appropriate probability space and use basic probability theory instead of inventing clever reasons why it doesn’t apply in this particular case.

I don't see how to do that but maybe your plan is to get to that at some point

Am I missing something? How is it at all controversial?

it's not, it's just a modification on the usual halfer argument that "you don't learn anything upon waking up"

  • halfers have to condition on there being at least one observer in the possible world. if the coin can come up 0,1,2 at 1/3 each, and Sleeping Beauty wakes up that number of times, halfers still think the 0 outcome is 0% likely upon waking up.
  • halfers also have to construct the reference class carefully. if there are many events of people with amnesia waking up once or twice, and SSA's reference class consists of the set of awakenings from these, then SSA and SIA will agree on a 1/3 probability. this is because in a large population, about 1/3 of awakenings are in worlds where the coin came up such that there would be one awakening.

I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.

Nice job with the bound! I've heard a number of people in my social sphere say very positive things about DACs so this is mainly my response to them.

You mentioned wanting to get the game theory of love correct. Understanding a game involves understanding the situations and motives of the involved agents. So getting the game theory of love correct with respect to some agent implies understanding that agent's situation.

This seems more like "imagining being nice to Hitler, as one could be nice to anyone" than "imagining what Hitler was in fact like and why his decisions seemed to him like the thing to do". Computing the game theoretically right strategy involves understanding different agents' situations, the kind of empathy that couldn't be confused with being a doormat, sometimes called "cognitive empathy".

I respect Sarah Constantin's attempt to understand Hitler's psychological situation.

If you define "human values" as "what humans would say about their values across situations", then yes, predicting "human values" is a reasonable training objective. Those just aren't really what we "want" as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.

That's also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It's possible that the objectives of these maximizers are affected by socialization, but they'll be less affected by socialization than verbal statements about values, because they're harder to fake so less affected by preference falsification.

Children learn some sense of what they're supposed to say about values, but have some pre-built sense of "what to do / aim for" that's affected by evopsych and so on. It seems like there's a huge semantic problem with talking about "values" in a way that's ambiguous between "in-built evopsych-ish motives" and "things learned from culture about what to endorse", but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term "values" rather than "preferences".

In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.

It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.

How would you design a task that incentivizes a system to output its true estimates of human values? We don't have ground truth for human values, because they're mind states not behaviors.

Seems easier to create incentives for things like "wash dishes without breaking them", you can just tell.

I'm mainly trying to communicate with people familiar with AI alignment discourse. If other people can still understand it, that's useful, but not really the main intention.

I do think this part is speculative. The degree of "inner alignment" to the training objective depends on the details.

Partly the degree to which "try to model the world well" leads to real-world agency depends on the details of this objective. For example, doing a scientific experiment would result in understanding the world better, and if there's RL training towards "better understand the world", that could propagate to intending to carry out experiments that increase understanding of the world, which is a real-world objective.

If, instead, the AI's dataset is fixed and it's trying to find a good compression of it, that's less directly a real-world objective. However, depending on the training objective, the AI might get a reward from thinking certain thoughts that would result in discovering something about how to compress the dataset better. This would be "consequentialism" at least within a limited, computational domain.

An overall reason for thinking it's at least uncertain whether AIs that model the world would care about it is that an AI that did care about the world would, as an instrumental goal, compliantly solve its training problems and some test problems (before it has the capacity for a treacherous turn). So, good short-term performance doesn't by itself say much about goal-directed behavior in generalizations.

The distribution of goals with respect to generalization, therefore, depends on things like which mind-designs are easier to find by the search/optimization algorithm. It seems pretty uncertain to me whether agents with general goals might be "simpler" than agents with task-specific goals (it probably depends on the task), therefore easier to find while getting ~equivalent performance. I do think that gradient descent is relatively more likely to find inner-aligned agents (with task-specific goals), because the internal parts are gradient descended towards task performance, it's not just a black box search.

Yudkowsky mentions evolution as an argument that inner alignment can't be assumed. I think there are quite a lot of dis-analogies between evolution and ML, but the general point that some training processes result in agents whose goals aren't aligned with the training objective holds. I think, in particular, supervised learning systems like LLMs are unlikely to exhibit this, as explained in the section on myopic agents.

Load More