LESSWRONG
LW

323
mattmacdermott
1300Ω15391620
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
leogao's Shortform
mattmacdermott21h20

Fair enough yeah. But at least (1)-style effects weren’t strong enough to prevent any significant legislation in the near future.

Reply
leogao's Shortform
mattmacdermott1d2-2

Some evidence for (2) is that before the 1957 act no civil rights legislation had been passed for 82 years[1], and after it three more civil rights acts were passed in the next 11 years, including the Civil Rights Act of 1964, which in my understanding is considered very significant.


  1. Going off what's listed in the wikipedia article on civil rights acts in the United States. ↩︎

Reply
Nathan Young's Shortform
mattmacdermott14d20

I thought the post was fine and was surprised it was so downvoted. Even if people don’t agree with the considerations, or think all the most important considerations are missing, why should a post saying, “Here’s what I think and why I think it, feel free to push back in the comments,” be so poorly received? Commenters can just say what they think is missing.

Seems likely that it wouldn’t have been so downvoted if its bottom line was that AI risk is very high. Increases my P(LW groupthink is a problem) a bit.

Reply
A Pitfall of "Expertise"
mattmacdermott15d2-2

Question marks and exclamation points go outside, unless they’re part of the sentence, and colons and semicolons always go outside.

An example of a sentence from the post that uses a colon is “It gets deeper than this, but the core problem remains the same”:

Surely nobody endorses putting the colon outside the quotes there! I feel like Opus/whoever is just assuming that people virtually never want to quote a piece of text that ends in a colon, rather than really wanting to endorse a different rule to the question mark case.

Reply
Yudkowsky on "Don't use p(doom)"
mattmacdermott1mo1413

I’m confused by the “necessary and sufficient” in “what is the minimum necessary and sufficient policy that you think would prevent extinction?”

Who’s to say there exists a policy which is both necessary and sufficient? Unless we mean something kinda weird by “policy” that can include a huge disjunction (e.g. “we do any of the 13 different things I think would work”) or can be very vague (e.g “we solve half A of the problem and also half B of the problem”).

It would make a lot more sense in my mind to ask “what is a minimal sufficient policy that you think would prevent extinction”?

Reply
dbohdan's Shortform
mattmacdermott1mo712

Usually lower numbers go on the left and bigger numbers go on the right (1, 2, 3,…) so seems reasonable to have it this way.

Reply
Training a Reward Hacker Despite Perfect Labels
mattmacdermott1mo30

RL generalization is controlled by why the policy took an action

Is this that good a framing for these experiments? Just thinking out loud:

Distinguish two claims

  1. what a model reasons about in its output tokens on the way to getting its answer affects how it will generalise
  2. why a model produces its output tokens affects how it will generalise

These experiments seem to test (1), while the claim from your old RL posts is more like (2).

You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.

As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).

Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.

So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).

Reply
Training a Reward Hacker Despite Perfect Labels
mattmacdermott1mo71

But their setup adds:

1.5. Remove any examples in which the steering actually resulted in the desired behaviour.

which is why it’s surprising.

Reply
Re: recent Anthropic safety research
mattmacdermott1mo*113

the core prompting experiments were originally done by me (the lead author of the paper) and I'm not an Anthropic employee. So the main results can't have been an Anthropic PR play (without something pretty elaborate going on).

Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.

Reply
Utility Maximization = Description Length Minimization
mattmacdermott2mo20

Sorry. can’t remember. Something done virtually, maybe during Covid.

Reply
Load More
4mattmacdermott's Shortform
2y
36
33Is instrumental convergence a thing for virtue-driven agents?
6mo
37
39Validating against a misalignment detector is very different to training against one
Ω
7mo
Ω
4
44Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
Ω
7mo
Ω
15
31Context-dependent consequentialism
Ω
11mo
Ω
6
28Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
Ω
1y
Ω
0
76Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"
Ω
2y
Ω
19
4mattmacdermott's Shortform
2y
36
59What's next for the field of Agent Foundations?
Ω
2y
Ω
23
36Optimisation Measures: Desiderata, Impossibility, Proposals
Ω
2y
Ω
9
29Reward Hacking from a Causal Perspective
Ω
2y
Ω
6
Load More