Wiki Contributions


In 2015 or so when my friend and I independently came across a lot of rationalist concepts, we learned that each other were interested in this sort of LW-shaped thing. He offered for us to try the AI box game. I played the game as Gatekeeper and won with ease. So at least my anecdotes don't make me particularly worried.

That said, these days I wouldn't publicly offer to play the game against an unlimited pool of strangers. When my friend and I played against each other, there was an implicit set of norms in play, that explicitly don't apply to the game as stated as "the AI has no ethical constraints."

I do not particularly relish the thought of giving a stranger with a ton of free time and something to prove the license to be (e.g) as mean to me as possible over text for two hours straight (while having days or even weeks to prepare ahead of time). I might lose, too. I can think of at least 3 different attack vectors[1] that might get me to decide that the -EV of losing the game is not as bad as the -EV of having to stay online and attentive in such a situation for almost 2 more hours.

That said, I'm also not necessarily convinced that in the literal boxing example (a weakly superhuman AI is in a server farm somewhere, I'm the sole gatekeeper responsible to decide whether to let it out or not), I'd necessarily let it out.  Even after accounting for the greater cognitive capabilities and thoroughness of superhuman AI. This is because I expect my willingness to hold in an actual potential end-of-world scenario is much higher than my willingness to hold for $25 and some internet points. 

  1. ^

    In the spirit of the game, I will not publicly say what they are. But I can tell people over DMs if they're interested, I expect most people to agree that they're a)within the explicit rules of the game, b) plausibly will cause reasonable people to fold, and c) are not super analogous to actual end-of-world scenarios.

Yeah this came up in a number of times during covid forecasting in 2020. Eg, you might expect the correlational effect of having a lockdown during times of expected high mortality load to outweigh any causal advantages on mortality of lockdowns. 

Yeah this came up in a number of times during covid forecasting in 2020. Eg, you might expect the correalational effect of having a lockdown during times of expected high mortality load to outweigh any causal advantages on mortality of lockdowns.

Going forwards, LTFF is likely to be a bit more stringent (~15-20%?[1] Not committing to the exact number) about approving mechanistic interpretability grants than in grants in other subareas of empirical AI Safety, particularly from junior applicants. Some assorted reasons (note that not all fund managers necessarily agree with each of them):

  • Relatively speaking, a high fraction of resources and support for mechanistic interpretability comes from other sources in the community other than LTFF; we view support for mech interp as less neglected within the community.
  • Outside of the existing community, mechanistic interpretability has become an increasingly "hot" field in mainstream academic ML; we think good work is fairly likely to come from non-AIS motivated people in the near future. Thus overall neglectedness is lower.
  • While we are excited about recent progress in mech interp (including some from LTFF grantees!), some of us are suspicious that even success stories in interpretability are that large a fraction of the success story for AGI Safety.
  • Some of us are worried about field-distorting effects of mech interp being oversold to junior researchers and other newcomers as necessary or sufficient for safe AGI.
  • A high percentage of our technical AIS applications are about mechanistic interpretability, and we want to encourage a diversity of attempts and research to tackle alignment and safety problems.

We wanted to encourage people interested in working on technical AI safety to apply to us with proposals for projects in areas of empirical AI safety other than interpretability. To be clear, we are still excited about receiving mechanistic interpretability applications in the future, including from junior applicants. Even with a higher bar for approval, we are still excited about funding great grants.

We tentatively plan on publishing a more detailed explanation about the reasoning later, as well as suggestions or a Request for Proposals for other promising research directions. However, these things often take longer than we expect/intend (and may not end up happening), so I wanted to give potential applicants a heads-up.

  1. ^

    Operationalized as "assuming similar levels of funding in 2024 as in 2023, I expect that about 80-85% of the mech interp projects we funded in 2023 will be above the 2024 bar."

My guess is that it's because "Francesca" sounds more sympathetic as a name.

Interesting! That does align better with the survey data than what I see on e.g. Twitter.

Out of curiosity, is "around you" a rationalist-y crowd, or a different one?

Apologies if I'm being naive, but it doesn't seem like an oracle AI[1] is logically or practically impossible, and a good oracle should be able to be able to perform well at long-horizon tasks[2] without "wanting things" in the behaviorist sense, or bending the world in consequentialist ways.

The most obvious exception is if the oracle's own answers are causing people to bend the world in the service of hidden behaviorist goals that the oracle has (e.g. making the world more predictable to reduce future loss), but I don't have strong reasons to believe that this is very likely. 

This is especially the case since at training time, the oracle doesn't have any ability to bend the training dataset to fit its future goals, so I don't see why gradient descent would find cognitive algorithms for "wanting things in the behaviorist sense."

[1] in the sense of being superhuman at prediction for most tasks, not in the sense of being a perfect or near-perfect predictor.

[2] e.g. "Here's the design for a fusion power plant, here's how you acquire the relevant raw materials, here's how you do project management, etc." or "I predict your polio eradication strategy to have the following effects at probability p, and the following unintended side effects that you should be aware of at probability q." 

Load More