LESSWRONG
LW

571
mattmacdermott
1297Ω15391600
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
4mattmacdermott's Shortform
2y
36
Nathan Young's Shortform
mattmacdermott6d20

I thought the post was fine and was surprised it was so downvoted. Even if people don’t agree with the considerations, or think all the most important considerations are missing, why should a post saying, “Here’s what I think and why I think it, feel free to push back in the comments,” be so poorly received? Commenters can just say what they think is missing.

Seems likely that it wouldn’t have been so downvoted if its bottom line was that AI risk is very high. Increases my P(LW groupthink is a problem) a bit.

Reply
A Pitfall of "Expertise"
mattmacdermott8d2-2

Question marks and exclamation points go outside, unless they’re part of the sentence, and colons and semicolons always go outside.

An example of a sentence from the post that uses a colon is “It gets deeper than this, but the core problem remains the same”:

Surely nobody endorses putting the colon outside the quotes there! I feel like Opus/whoever is just assuming that people virtually never want to quote a piece of text that ends in a colon, rather than really wanting to endorse a different rule to the question mark case.

Reply
Yudkowsky on "Don't use p(doom)"
mattmacdermott22d1413

I’m confused by the “necessary and sufficient” in “what is the minimum necessary and sufficient policy that you think would prevent extinction?”

Who’s to say there exists a policy which is both necessary and sufficient? Unless we mean something kinda weird by “policy” that can include a huge disjunction (e.g. “we do any of the 13 different things I think would work”) or can be very vague (e.g “we solve half A of the problem and also half B of the problem”).

It would make a lot more sense in my mind to ask “what is a minimal sufficient policy that you think would prevent extinction”?

Reply
dbohdan's Shortform
mattmacdermott1mo712

Usually lower numbers go on the left and bigger numbers go on the right (1, 2, 3,…) so seems reasonable to have it this way.

Reply
Training a Reward Hacker Despite Perfect Labels
mattmacdermott1mo30

RL generalization is controlled by why the policy took an action

Is this that good a framing for these experiments? Just thinking out loud:

Distinguish two claims

  1. what a model reasons about in its output tokens on the way to getting its answer affects how it will generalise
  2. why a model produces its output tokens affects how it will generalise

These experiments seem to test (1), while the claim from your old RL posts is more like (2).

You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.

As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).

Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.

So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).

Reply
Training a Reward Hacker Despite Perfect Labels
mattmacdermott1mo71

But their setup adds:

1.5. Remove any examples in which the steering actually resulted in the desired behaviour.

which is why it’s surprising.

Reply
Re: recent Anthropic safety research
mattmacdermott1mo*113

the core prompting experiments were originally done by me (the lead author of the paper) and I'm not an Anthropic employee. So the main results can't have been an Anthropic PR play (without something pretty elaborate going on).

Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.

Reply
Utility Maximization = Description Length Minimization
mattmacdermott1mo20

Sorry. can’t remember. Something done virtually, maybe during Covid.

Reply
mattmacdermott's Shortform
mattmacdermott2mo-33

ELK = easy-to-hard generalisation + assumption the model already knows the hard stuff?

Reply
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
mattmacdermott2mo62

It's the last sentence of the first paragraph of section 1.

Reply1
Load More
33Is instrumental convergence a thing for virtue-driven agents?
5mo
37
39Validating against a misalignment detector is very different to training against one
Ω
6mo
Ω
4
44Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
Ω
7mo
Ω
15
31Context-dependent consequentialism
Ω
10mo
Ω
6
28Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
Ω
1y
Ω
0
76Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"
Ω
2y
Ω
19
4mattmacdermott's Shortform
2y
36
59What's next for the field of Agent Foundations?
Ω
2y
Ω
23
36Optimisation Measures: Desiderata, Impossibility, Proposals
Ω
2y
Ω
9
29Reward Hacking from a Causal Perspective
Ω
2y
Ω
6
Load More