LESSWRONG
LW

mattmacdermott
1296Ω15391580
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4mattmacdermott's Shortform
2y
36
No wikitag contributions to display.
Yudkowsky on "Don't use p(doom)"
mattmacdermott9d1312

I’m confused by the “necessary and sufficient” in “what is the minimum necessary and sufficient policy that you think would prevent extinction?”

Who’s to say there exists a policy which is both necessary and sufficient? Unless we mean something kinda weird by “policy” that can include a huge disjunction (e.g. “we do any of the 13 different things I think would work”) or can be very vague (e.g “we solve half A of the problem and also half B of the problem”).

It would make a lot more sense in my mind to ask “what is a minimal sufficient policy that you think would prevent extinction”?

Reply
dbohdan's Shortform
mattmacdermott16d712

Usually lower numbers go on the left and bigger numbers go on the right (1, 2, 3,…) so seems reasonable to have it this way.

Reply
Training a Reward Hacker Despite Perfect Labels
mattmacdermott17d30

RL generalization is controlled by why the policy took an action

Is this that good a framing for these experiments? Just thinking out loud:

Distinguish two claims

  1. what a model reasons about in its output tokens on the way to getting its answer affects how it will generalise
  2. why a model produces its output tokens affects how it will generalise

These experiments seem to test (1), while the claim from your old RL posts is more like (2).

You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.

As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).

Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.

So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).

Reply
Training a Reward Hacker Despite Perfect Labels
mattmacdermott17d71

But their setup adds:

1.5. Remove any examples in which the steering actually resulted in the desired behaviour.

which is why it’s surprising.

Reply
Re: recent Anthropic safety research
mattmacdermott26d*113

the core prompting experiments were originally done by me (the lead author of the paper) and I'm not an Anthropic employee. So the main results can't have been an Anthropic PR play (without something pretty elaborate going on).

Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.

Reply
Utility Maximization = Description Length Minimization
mattmacdermott1mo20

Sorry. can’t remember. Something done virtually, maybe during Covid.

Reply
mattmacdermott's Shortform
mattmacdermott1mo-33

ELK = easy-to-hard generalisation + assumption the model already knows the hard stuff?

Reply
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
mattmacdermott2mo62

It's the last sentence of the first paragraph of section 1.

Reply1
Authors Have a Responsibility to Communicate Clearly
mattmacdermott2mo123

Not sure how to think about this overall. I can come up with examples where it seems like you should assign basically full credit for sloppy or straightforwardly wrong statements.

E.g. suppose Alice claims that BIC only make black pens. Bob says, "I literally have a packet of blue BIC pens in my desk drawer. We will go to my house, open the drawer, and you will see them." They go to Bob's house, and lo, the desk drawer is empty. Turns out the pens are on the kitchen table instead. Clearly it's fine for Bob to say, "All I really meant was that I had blue pens at my house, the point stands."

I think your mention of motte-and-baileys probably points at the right refinement: maybe it's fine to be sloppy if the version you later correct yourself to has the same implications as what you literally said. But if you correct yourself to something easier to defend but that doesn't support your initial conclusion to the same extent, that's bad.

EDIT: another important feature of the pens example is that the statement Bob switched to is uncontroversially true. If on finding the desk drawer empty he instead wanted to switch to, "I left them at work", then probably he should pause and admit a mistake first.

Reply
Authors Have a Responsibility to Communicate Clearly
mattmacdermott2mo92

Independently of the broader point, here's some comments on the particular example from the Scientist AI paper (source: I am an author):

  • reward tampering arguments were and are a topic of disagreement between the authors, some of whom have views similar to yours, and some of whom have views well-expressed by the quoted passage
  • I predict the lead author (Yoshua) would indeed own that passage and not say, "That was sloppy, what we really meant was..."
  • so while it's fine as an example of readers perhaps inappropriately substituting an interpretation that seems more correct to them, I think it's not great as an example of authors motte-and-baileying or whatever (although admittedly having a bunch of different authors on a document can push against precision in areas where there's less consensus)
Reply
Load More
33Is instrumental convergence a thing for virtue-driven agents?
5mo
37
39Validating against a misalignment detector is very different to training against one
Ω
6mo
Ω
4
44Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
Ω
6mo
Ω
15
31Context-dependent consequentialism
Ω
10mo
Ω
6
28Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
Ω
1y
Ω
0
76Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"
Ω
2y
Ω
19
4mattmacdermott's Shortform
2y
36
59What's next for the field of Agent Foundations?
Ω
2y
Ω
23
36Optimisation Measures: Desiderata, Impossibility, Proposals
Ω
2y
Ω
9
29Reward Hacking from a Causal Perspective
Ω
2y
Ω
6
Load More