Currently doing the SERI MATS 4.0 Training Program (multipolar stream). Former summer research fellow at Center on Long-Term Risk, former intern at Center for Reducing Suffering, Wild Animal Initiative, Animal Charity Evaluators. Former co-head of Haverford Effective Altruism.

Research interests: • AI alignment • animal advocacy from a longtermist perspective • acausal interactions • artificial sentience • s-risks • updatelessness

Feel free to contact me for whatever reason! You can set up a meeting with me here.

Wiki Contributions


Answer by JamesFaville30

How to deal with crucial considerations and deliberation ladders (link goes to a transcript + audio).

I like this post a lot! Three other reasons came to mind, which might be technically encompassed by some of the current ones but seemed to mostly fall outside the post's framing of them at least.

Some (non-agentic) repeated selections won't terminate until they find a bad thing
In a world with many AI deployments, an overwhelming majority of deployed agents might be unable to mount a takeover, but the generating process for new deployed agents might not halt until a rare candidate that can mount a takeover is found. More specifically, consider a world where AI progress slows (either due to governance interventions or a new AI winter), but people continue conducting training runs at a fairly constant level of sophistication. Suppose that for these state-of-the-art training runs that (i) there is only a negligible chance of finding a non-gradient-hacked AI that can mount a takeover or enable a pivotal act, but (ii) there is a tiny but nonnegligible chance of finding a gradient hacker that can mount a takeover.[1] Then eventually we will stumble across an unlikely training run that produces a gradient hacker.

This problem mostly seems like a special case of You're being optimised against, though here you are not optimised against by an agent, but rather by the nature of the problem. Alternatively, this example could be lumped into The space you’re selecting over happens to mostly contain bad things if we either (i) reframe the space under consideration from "deployed AIs" to "AIs capable of mounting a takeover" (h/t Thomas Kehrenberg), or (ii) reframe The space you’re selecting over happens to mostly contain bad things to The space you’re selecting over happens to mostly contain bad things, relative to the number of selections made.  But I think the fact that a selection may not terminate until a bad thing has been found is an important thing to pay attention to when it comes up, and weakly think it'd be useful to have a separate conceptual handle for it.

Aiming your efforts at worst-case scenarios
As long as some failure states are worse than others, optimising for the satisfaction of a binary success criterion won't generally be sufficient to maximise your marginal impact. Instead, you should target worlds based in part on how bad failure within them would be, along with the change in success probability for a marginal contribution. For example, maybe many low P(doom) worlds are such because intent-aligning AI turns out to be pretty straightforward in them. But easy intent-alignment may imply higher misuse risk, such that if misuse risk is more concerning than accident risk then contributing towards solving alignment problems in ways robust to misuse may remain very high impact in easy-intent-alignment worlds.[2]

One alternative way to state this consideration is that in most domains, there are actually multiple overlapping success criteria. Sometimes the more easily satisfied ones will be much higher-priority to target—even if your marginal contributions result in smaller changes to the odds of satisfying them—because they are more important.

This consideration is the main reason I prioritise worst-case AI outcomes (i.e. s-risks) over ordinary x-risk from AI.

Some bad things might be really bad
In a similar vein, for The space you’re selecting over happens to mostly contain bad things, it's not the raw probability of selecting a bad thing that matters, but the product of that with the expected harm of a bad thing. Since some bad things are Really Very Terrible, sometimes it will make sense to use worst-case assumptions even when bad things are quite rare, as long as the risk of finding one isn't Pascalian. I think the EU of an insecure selection is at particular risk of being awful whenever the left tail of the utility distribution of things you're selecting for is much thicker than the right.

  1. ^

    This is plausible to me because gradient-hacking could yield a "sharp left turn", taking us very OOD for the sort of models runs had previously been producing. Some other sharp left turn candidates should work just as well in this example.

  2. ^

    This is an interesting example, because in low P(doom) worlds of this sort marginal efforts to advance intent-alignment seem more likely to be harmful. If that were the case, alignment researchers would want to prioritise developing techniques that differentially help align AI to widely endorsed values rather than to the intent of an arbitrary deployer. Efforts to more directly intervene to prevent misuse would also look pretty valuable.

    But because of effects like these, it's not obvious that you would want to prioritise low P(doom) worlds even if you were convinced that failure within them was worse than in high P(doom) worlds, since advancing-intent-alignment interventions might be helpful in most other worlds where it might be harder for malevolent users to make use of them. (And it's definitely not apparent to me in reality that failure in low P(doom) worlds is worse than in high P(doom) worlds for this reason; I just thought this would make for a good example!)

Another way interpretability work can be harmful: some means by which advanced AIs could do harm require them to be credible. For example, in unboxing scenarios where a human has something an AI wants (like access to the internet), the AI might be much more persuasive if the gatekeeper can verify the AI's statements using interpretability tools. Otherwise, the gatekeeper might be inclined to dismiss anything the AI says as plausibly fabricated. (And interpretability tools provided by the AI might be more suspect than those developed beforehand.)

It's unclear to me whether interpretability tools have much of a chance of becoming good enough to detect deception in highly capable AIs. And there are promising uses of low-capability-only interpretability -- like detecting early gradient hacking attempts, or designing an aligned low-capability AI that we are confident will scale well. But to the extent that detecting deception in advanced AIs is one of the main upsides of interpretability work people have in mind (or if people do think that interpretability tools are likely to scale to highly capable agents by default), the downsides of those systems being credible will be important to consider as well.

There is another very important component of dying with dignity not captured by the probability of success: the badness of our failure state. While any alignment failure would destroy much of what we care about, some alignment failures would be much more horrible than others. Probably the more pessimistic we are about winning, the more we should focus on losing less absolutely (e.g. by researching priorities in worst-case AI safety).

I feel conflicted about this post. Its central point as I'm understanding it is that much evidence we commonly encounter in varied domains is only evidence about the abundance of extremal values in some distribution of interest, and whether/how we should update our beliefs about the non-extremal parts of the distribution is very much dependent on our prior beliefs or gears-level understanding of the domain. I think this is a very important idea, and this post explains it well.

Also, felt inspired to search out other explanations of the moments of a distribution - this one looks pretty good to me so far.

On the other hand, the men's rights discussion felt out of place to me, and unnecessarily so since I think other examples would be able to work just as well. Might be misjudging how controversial various points you bring up are, but as of now I'd rather see topics of this level of potential political heat discussed in personal blogposts or on other platforms, so long as they're mostly unrelated to central questions of interest to rationalists / EAs.

This is super interesting!

Quick typo note (unless I'm really misreading something): in your setups, you refer to coins that are biased towards tails, but in your analyses, you talk about the coins as though they are biased towards heads.

One is the “cold pool”, in which each coin comes up 1 (i.e. heads) with probability 0.1 and 0 with probability 0.9. The other is the “hot pool”, in which each coin comes up 1 with probability 0.2

 random coins with heads-probability 0.2

We started with only  tails

full compression would require roughly  tails, and we only have about 

As far as I'm aware, there was not (in recent decades at least) any controversy that word/punctuation choice was associative. We even have famous psycholinguistics experiments telling us that thinking of the word "goose" makes us more likely to think of the word "moose" as well as "duck" (linguistic priming is the one type of priming that has held up to the replication crisis as far as I know). Whenever linguists might have bothered to make computational models, I think those would have failed to produce human-like speech because their associative models were not powerful enough.

This comment does not deserve to be downvoted; I think it's basically correct. GPT-2 is super-interesting as something that pushes the bounds of ML, but is not replicating what goes on under-the-hood with human language production, as Marcus and Pinker were getting at. Writing styles don't seem to reveal anything deep about cognition to me; it's a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.

Why should we say that someone has "information empathy" instead of saying they possess a "theory of mind"?

Possible reasons: "theory of mind" is an unwieldy term, it might be useful to distinguish in fewer words a theory of mind with respect to beliefs from a theory of mind with respect to preferences, you want to emphasise a connection between empathy and information empathy.

I think if there's established terminology for something we're interesting in discussing, there should be a pretty compelling reason why it doesn't suffice for us.

It felt weird to me to describe shorter timeline projections as "optimistic" and longer ones as "pessimistic"- AI research taking place over a longer period is going to be more likely to give us friendly AI, right?

Load More