LESSWRONG
LW

1634
Thomas Kwa
7065Ω566268110
Message
Dialogue
Subscribe

Member of technical staff at METR.

Previously: MIRI → interp with Adrià and Jason → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
11Thomas Kwa's Shortform
Ω
6y
Ω
293
Catastrophic Regressional Goodhart
No wikitag contributions to display.
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thomas Kwa20m20

After thinking about it more, it might take more than 3% even if things scale smoothly because I'm not confident corrigibility is only a small fraction of labs' current safety budgets

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thomas Kwa21h20

Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user's inferred preferences in every domain, not in the sense of AI that only understands physics. I don't expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what's good for business is reasonably aligned, and getting there requires something like 3% of the lab's resources.

Reply
Benito's Shortform Feed
Thomas Kwa1d20

Big fan of the "bowing out" react. I did notice a minor UI issue where the voting arrows don't fit in the box:

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thomas Kwa1d30

Inasmuch as we're going for corrigibility, it seems necessary and possible to create an agent that won't self-modify into an agent with complete preferences. Complete preferences is antithetical to narrow agents, and would mean the agent might try to eg solve the Israel-Palestine conflict when all you asked it to do is code you a website. Even if there is a working stop button this is a bad situation. It seems likely we can just train against this sort of thing, though maybe it will require being slightly clever.

As for whether it we can/should have it self-modify to avoid more basic kinds of money-pumps so its preference are at least transitive and independent, this is an empirical question I'm extremely unsure of, but we should aim for the least dangerous agent that gets the desired performance, which should balance the propensity for misaligned actions with the additional capability needed to overcome irrationality.

Reply
Raemon's Shortform
Thomas Kwa1d*92

I was aware there is some line but thought it was "don't ignite a conversation that derails this one" rather than "don't say inaccurate things about groups", which is why I listed lots of groups rather than one and declined to list actively contentious topics like timelines, IABIED reviews, or Matthew Barnett's opinions

Reply
Raemon's Shortform
Thomas Kwa1d50

This was not my intention, though I could have been more careful. Here are my reasons

  • The original comment seemed really vague, in a way that often dooms conversations. Little progress can be made on most problems without pointing out the specific key reasons for them. The key point to make is that tribalism in this case doesn't arise spontaneously based on identities alone, it has micro level causes which have macro level causes
  • I thought Ray's wanted to discuss what to do for a broader communication strategy, so replying in shortform would be fine because the output would get >20x the views (this is where I could have realized LW shortform has a high profile now, and toned it down somehow), rather than open up the conversation here
  • I am also frustrated about tribalism and reporting from experience about what I notice in a somewhat exaggerated way. If there is defeatism this is the source, though I don't think addressing it is impossible, I just don't have any ideas
  • If people replied to me with object level twitterbrained comments about how eg everyone has to unite against Marc Andreessen I would be super sad. Hopefully we're better than that.
Reply111
Raemon's Shortform
Thomas Kwa1d125

It seems like everyone is tired of hearing every other group's opinions about AI. Since like 2005, Eliezer has been hearing people say a superintelligent AI surely won't be clever, and has had enough. The average LW reader is tired of hearing obviously dumb Marc Andreessen accelerationist opinions. The average present harms person wants everyone to stop talking about the unrealistic apocalypse when artists are being replaced by shitty AI art. The average accelerationist wants everyone to stop talking about the unrealistic apocalypse when they could literally cure cancer and save Western civilization. The average NeurIPS author is sad that LLMs have made their expertise in Gaussian kernel wobblification irrelevant. Various subgroups of LW readers are dissatisfied with people who think reward is the optimization target, Eliezer is always right, or discussion is too tribal, or whatever.

With this combined with how Twitter distorts discourse is it any wonder that people need to process things as "oh, that's just another claim by X group, time to dismiss"? Anyway I think naming the groups isn't the problem, and so naming the groups in the post isn't contributing to the problem much. The important thing to address is why people find it advantageous to track these groups.

Reply4211
faul_sname's Shortform
Thomas Kwa2d102

Nice find; this may be where the real "glitch tokens" work starts. 

Reply1
Burny's Shortform
Thomas Kwa2d6-2

This seems unlikely for two reasons. First, models can't control the words in their reasoning traces well; even if they're trying to hide them they often blurt out "I must not say that I'm doing X". Second, if the model actually needs to use its CoT rather than narrating something it could have figured out in a single forward pass, it would need to encode misaligned reasoning rather than just omitting it.

So I think CoT will be monitorable until some combination of

  • models are no longer trained to narrate their thoughts
  • hiding misaligned thoughts is rewarded during training
  • the behaviors we're trying to catch are possible in one forward pass for the model

All of these could happen by late next year for some usecases, but I don't think it will be emergent property of next generation just having more situational awareness.

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thomas Kwa2d31

I don't think so. While working with Vivek I made a list once of ways agents could be partially consequentialist but concluded that doing game theory type things didn't seem enlightening.

Maybe it's better to think about "agents that are very capable and survive selection processes we put them under" rather than "rational agents" because the latter implies it should be invulnerable to all money-pumps, which is not a property we need or want.

Reply
Load More
61Claude, GPT, and Gemini All Struggle to Evade Monitors
Ω
2mo
Ω
3
84METR: How Does Time Horizon Vary Across Domains?
3mo
8
69Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
5mo
21
115Should CA, TX, OK, and LA merge into a giant swing state, just for elections?
11mo
35
37The murderous shortcut: a toy model of instrumental convergence
Ω
1y
Ω
0
12Goodhart in RL with KL: Appendix
Ω
1y
Ω
0
62Catastrophic Goodhart in RL with KL penalty
Ω
1y
Ω
10
38Is a random box of gas predictable after 20 seconds?
Q
2y
Q
35
66Will quantum randomness affect the 2028 election?
Q
2y
Q
52
79Thomas Kwa's research journal
Ω
2y
Ω
1
Load More