LESSWRONG
LW

2836
Thomas Kwa
7119Ω566268190
Message
Dialogue
Subscribe

Member of technical staff at METR.

Previously: MIRI → interp with Adrià and Jason → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Catastrophic Regressional Goodhart
11Thomas Kwa's Shortform
Ω
6y
Ω
303
Thomas Kwa's Shortform
Thomas Kwa5d20

Does this meaningfully reduce the probability that you jump out of the way of a car or get screened for heart disease? The important thing isn't whether you have an emotional fear response, but how the behavior pattern of avoiding generalizes.

Reply
Thomas Kwa's Shortform
Thomas Kwa5d20

Much of my hope is that by the time we reach a superintelligence level where we need to instill reflectively endorsed values to optimize towards in a very hands-off way rather than just constitutions, behaviors, or goals, we'll have figured something else out. I'm not claiming the optimizer advantage alone is enough to be decisive in saving the world.

To the point about tighter feedback loops, I see the main benefit as being in conjunction with adapting to new problems. Suppose that we notice AIs take some bad but non-world-ending action like murdering people; then we can add a big dataset of situations in which AIs shouldn't murder people to the training data. If we were instead breeding animals, we would have to wait dozens of generations for mutations that reduce murder rate to appear and reach fixation. Since these mutations affect behavior through brain architecture, they would have a higher chance of deleterious effects. And if we're also selecting for intelligence, they would be competing against mutations that increase intelligence, producing a higher alignment tax. All this means that we have less chance to detect whether our proxies hold up (capabilities researchers have many of these advantages too, but the AGI would be able to automate capabilities training anyway).

If we expect problems to get worse at some rate until an accumulation of unsolved alignment issues culminates in disempowerment, it seems to me there is a large band of rates where we can stay ahead of them with AI training but evolution wouldn't be able to.

Reply
Thomas Kwa's Shortform
Thomas Kwa5d43

Noted. Somewhat surprised you believe in quantum immortality, is there a particular reason?

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thomas Kwa5d20

EJT's incomplete preferences proposal. But as far as I'm able to make out from the comments, you need to define a decision rule in addition to the utility function of an agent with incomplete preferences, and only some of those ways are compatible with shutdownability.

Reply1
Omelas Is Perfectly Misread
Thomas Kwa5d73

When I read it in school, the story frustrated me because I immediately wanted to create Omelas seeing as it's a thousand times better than our society, so I didn't really get the point of the intended and/or common interpretations.

Reply
Thomas Kwa's Shortform
Thomas Kwa6d74

Gradient descent (when applied to train AIs) allows much more fine-grained optimization than evolution, for these reasons:

  • Evolution by natural selection acts on the genome, which can only crudely affect behavior and only very indirectly affect values, whereas gradient descent acts on the weights which much more directly affect the AI's behavior and maybe can affect values
  • Evolution can only select between two alleles in a discrete way, whereas gradient descent operates over a continuous space
  • Evolution has a minimum feedback loop of one organism generation, whereas RL has a much shorter minimum feedback loop length of one episode
  • Evolution can only combine information from different individuals inefficiently through sex, whereas we can combine gradients from many episodes to produce one AI that's learned strategies from all episodes
  • We can adapt our alignment RL methods, data, hyperparameters, and objectives as we observe problems in the wild
  • We can do adversarial training against other AIs, but ancestral humans didn't have to contend with animals whose goal was to trick them into not reproducing by any means necessary; the closest was animals that try to kill us. (Our fear of death is therefore much more robust than our desire to maximize reproductive fitness)
  • On current models, we can observe the chain of thought (although the amount we can train against it while maintaining faithfulness is limited)
  • We can potentially do interpretability (if that ever works out)

It's unclear the degree to which these will solve inner alignment problems or cause AI goals to be more robust than animal goals to distributional shift, but we're in much better shape than evolution was.

Reply
Thomas Kwa's Shortform
Thomas Kwa7d*298

If you disagree with much of IABIED but are still worried about AI risk, maybe the question to ask is "will the radical flank effect be positive or negative on mainstream AI safety movements?", which seems more useful than "do I on net agree or disagree?" or "will people taking this book at face value do useful or anti-useful things?" Here's what Wikipedia has to say on the sign of a radical flank effect:

It's difficult to tell without hindsight whether the radical flank of a movement will have positive or negative effects.[2] However, following are some factors that have been proposed as making positive effects more likely:

  • Greater differentiation between moderates and radicals in the presence of a weak government.[2][13][14]: 411 As Charles Dobson puts it: "To secure their place, the new moderates have to denounce the actions of their extremist counterparts as irresponsible, immoral, and counterproductive. The most astute will quietly encourage 'responsible extremism' at the same time."[15]
  • Existing momentum behind the cause. If change seems likely to happen anyway, then governments are more willing to accept moderate reforms in order to quell radicals.[2]
  • Radicalism during the peak of activism, before concessions are won.[16] After the movement begins to decline, radical factions may damage the image of moderate organizations.[16]
  • Low polarization. If there's high polarization with a strong opposing side, the opposing side can point to the radicals in order to hurt the moderates.[2]

Of course it's still useful to debate on which factual points the book is accurate, but making judgments of the book's overall value requires modeling other parts of the world.

Reply1
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thomas Kwa7d40

Yeah I expect corrigibility to get a lot worse by the 10x economy level with at least 15% probability, as my uncertainty is very large, just not in the median case. The main reason is that we don't need to try very hard yet to get sufficient corrigibility from models. My very rough model is even if the amount of corrigibility training required, say, 2x every time horizon doubling, whereas the amount of total training required 1.5x per time horizon doubling, we will get 10 more time horizon doublings with only a (2/1.5)^10 = 18x increase in relative effort dedicated to corrigibility training. This seems doable given that relatively little of the system cards of current models is dedicated to shutdownability.

As for (b), my guess is that with NO corrigibility training at all, models would start doing things like disabling shutdown scripts in the wild, locking users out of their computers and changing their passwords to prevent them manually shutting the agent down, and there would be outcry from public and B2B customers that hurts their profits, as well as a dataset of examples to train against. It's plausible that fixing this doesn't allow them to stop disempowerment. Maybe a crux is whether naively training for corrigibility is more effective or less effective on more egregious incidents.

After the 10xing the economy stage, plausibly corrigibility stops being useful for users / aligned with the profit motive because humans will largely stop understanding what models are doing, so I get a lot more worried if we haven't figured something out.

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thomas Kwa7d20

After thinking about it more, it might take more than 3% even if things scale smoothly because I'm not confident corrigibility is only a small fraction of labs' current safety budgets

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thomas Kwa8d20

Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user's inferred preferences in every domain, not in the sense of AI that only understands physics. I don't expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what's good for business is reasonably aligned, and getting there requires something like 3% of the lab's resources.

Reply
Load More
61Claude, GPT, and Gemini All Struggle to Evade Monitors
Ω
2mo
Ω
3
84METR: How Does Time Horizon Vary Across Domains?
3mo
8
69Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
5mo
21
115Should CA, TX, OK, and LA merge into a giant swing state, just for elections?
1y
35
37The murderous shortcut: a toy model of instrumental convergence
Ω
1y
Ω
0
12Goodhart in RL with KL: Appendix
Ω
1y
Ω
0
62Catastrophic Goodhart in RL with KL penalty
Ω
1y
Ω
10
38Is a random box of gas predictable after 20 seconds?
Q
2y
Q
35
66Will quantum randomness affect the 2028 election?
Q
2y
Q
52
79Thomas Kwa's research journal
Ω
2y
Ω
1
Load More