Daniel Kokotajlo

Daniel Kokotajlo's Comments

The Main Sources of AI Risk?

Ha! I wake up this morning to see my own name as author, that wasn't what I had in mind but it sure does work to motivate me to walk the talk! Thanks!

Is backwards causation necessarily absurd?

Perhaps I was too hasty. What I had in mind was the effective strategy strategy--if you define causation by reference to what's an effective strategy for achieving what, then that means you are assuming a certain decision theory in order to define causation. And so e.g. one-boxing will cause you to get a million if EDT is true, but not if CDT is true.

If instead you have another way to define causation, then I don't know. But for some ways, you are just fighting the hypothetical--OK, so maybe in the original Newcomb's Problem as stated, backwards causation saves the day and makes CDT and EDT agree on what to do. But then what about a modified version where the backwards causation is not present?

Predictors exist: CDT going bonkers... forever

Dagon, I sympathize. CDT seems bonkers to me for the reasons you have pointed out. My guess is that academic philosophy has many people who support CDT for three main reasons, listed in increasing order of importance:

(1) Even within academic philosophy, many people aren't super familiar with these arguments. They read about CDT vs. EDT, they read about a few puzzle cases, and they form an opinion and then move on--after all, there are lots of topics to specialize in, even in decision theory, and so if this debate doesn't grip you you might not dig too deeply.

(2) Lots of people have pretty strong intuitions that CDT vindicates. E.g. iirc Newcomb's Problem was originally invented to prove that EDT was silly (because, silly EDT, it would one-box, which is obviously stupid!) My introductory textbook to decision theory was an attempt to build for CDT an elegant mathematical foundation to rival the jeffrey-bolker axioms for EDT. And why do this? It said, basically, "EDT gives the wrong answer in Newcomb's Problem and other problems, so we need to find a way to make some version of CDT mathematically respectable."

(3) EDT has lots of problems too. Even hardcore LWer fans of EDT like Caspar Oesterheld admit as much, and even waver back and forth between EDT and CDT for this reason. And the various alternatives to EDT and CDT that have been thus far proposed also seem to have problems.

Predictors exist: CDT going bonkers... forever
To summarize my confusion, does CDT require that the agent unconditionally believe in perfect free will independent of history (and, ironically, with no causality for the exercise of will)? If so, that should be the main topic of dispute - the frequency of actual case where it makes bad predictions, not that it makes bad decisions in ludicrously-unlikely-and-perhaps-impossible situations.

Sorta, yes. CDT requires that you choose actions not by thinking "conditional on my doing A, what happens?" but rather by some other method (there are different variants) such as "For each causal graph that I think could represent the world, what happens when I intervene (in Pearl's sense) on the node that is my action, to set it to A?)" or "Holding fixed the probability of all variables not causally downstream of my action, what happens if I do A?"

In the first version, notice that you are choosing actions by imagining a Pearl-style intervention into the world--but this is not something that actually happens; the world doesn't actually contain such interventions.

In the second version, well, notice that you are choosing actions by imagining possible scenarios that aren't actually possible--or at least, you are assigning the wrong probabilities to them. ("holding fixed the probability of all variables not causally downstream of my action...")

So one way to interpret CDT is that it believes in crazy stuff like hardcore incompatibilist free will. But the more charitable way to interpret it is that it doesn't believe in that stuff, it just acts as if it does, because it thinks that's the rational way to act. (And they have plenty of arguments for why CDT is the rational way to act, e.g. the intuition pump "If the box is already either full or empty and you can't change that no matter what you do, then no matter what you do you'll get more money by two-boxing, so..."

Is backwards causation necessarily absurd?

I don't think backwards causation is absurd, more or less for the reasons you sketch. Another minor reason: Some philosophers like "effective strategy" accounts of causation, according to which we define causation via its usefulness for agents trying to achieve goals. On these accounts, backwards causation is pretty trivial--just suppose you live in a deterministic universe and your goal is to "make the state of the universe at the Big Bang such that I eat breakfast tomorrow." The philosopher Gary Drescher argues something similar in Good and Real if I recall correctly.

That said, I don't think we are really explaining or de-confusing anything if we appeal to backwards causation to understand Newcomb's Problem or argue for a particular solution to it.

The Main Sources of AI Risk?

Thank you for making this list. I think it is important enough to be worth continually updating and refining; if you don't do it then I will myself someday. Ideally there'd be a whole webpage or something, with the list refined so as to be disjunctive, and each element of the list catchily named, concisely explained, and accompanied by a memorable and plausible example. (As well as lots of links to literature.)

I think the commitment races problem is mostly but not entirely covered by #12 and #19, and at any rate might be worth including since you are OK with overlap.

Also, here's a good anecdote to link to for the "coding errors" section: https://openai.com/blog/fine-tuning-gpt-2/

Predictors exist: CDT going bonkers... forever

Well said.

I had a similar idea a while ago and am working it up into a paper ("CDT Agents are Exploitable"). Caspar Oesterheld and Vince Conitzer are also doing something like this. And then there is Ahmed's Betting on the Past case.

In their version, the Predictor offers bets to the agent, at least one of which the agent will accept (for the reasons you outline) and thus they get money-pumped. In my version, there is no Predictor, but instead there are several very similar CDT agents, and a clever human bookie can extract money from them by exploiting their inability to coordinate.

Long story short, I would bet that an actual AGI which was otherwise smarter than me but which doggedly persisted in doing its best to approximate CDT would fail spectacularly one way or another, "hacked" by some clever bookie somewhere (possibly in its hypothesis space only!). Unfortunately, arguably the same is true for all decision theories I've seen so far, but for different reasons...

What determines the balance between intelligence signaling and virtue signaling?

I don't think you are crazy; I worry about this too. I think I should go read a book about the Cultural Revolution to learn more about how it happened--it can't have been just Mao's doing, because e.g. Barack Obama couldn't make the same thing happen in the USA right now (or even in a deep-blue part of the USA!) no matter how hard he tried. Some conditions must have been different.*

*Off the top of my head, some factors that seem relevant: Material deprivation. Overton window so narrow and extreme that it doesn't overlap with everyday reality. Lack of outgroup that is close enough to blame for everything yet also powerful enough to not be crushed swiftly.

I don't think it could happen in the USA now, but I think maybe in 20 years it could if trends continue and/or get worse.

Then there are the milder forms, that don't involve actually killing anybody but just involve getting people fired, harassed, shamed, discriminated against, etc. That seems much more likely to me--it already happens in very small, very ideologically extreme subcultures/communities--but also much less scary. (Then again, from a perspective of reducing AI risk, this scenario would be almost as bad maybe? If the AI safety community undergoes a "soft cultural revolution" like this, it might seriously undermine our effectiveness)

How to Identify an Immoral Maze


What’s better than having skin in the game? Having soul in the game. Caring deeply about the outcome for reasons other than money, or your own liability, or being potentially scapegoated. Caring for existential reasons, not commercial ones.
Soul in the game is incompatible with mazes. Mazes will eliminate anyone with soul in the game. Therefore, if the people you work for have soul in the game, you’re safe. If you have it too, you’ll be a lot happier, and likely doing something worthwhile. Things will be much better on most fronts. 

In general I worry that the advice you are giving is phrased too confidently. This quote about soul in particular stood out to me. I have a few friends who have worked for big hierarchical non-profits, and their experience seems to contradict it. Plenty of people who do seem pretty passionate about 'the cause' and yet lots of dysfunction, bureaucracy, office politics, metric-gaming, etc. Maybe these problems didn't rise to the level of a true moral maze, or maybe the people weren't actually passionate but really were just posturing. But maybe not, and at any rate how do you tell at a glance?

Load More