Wiki Contributions


Sure, understood.

However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.

I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me that this isn't sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)

Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.

Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).

Models don't need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.

Overall, I'm with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn't understand an author better than coworkers] -> [unlikely there's a superhuman persuasion strategy] argument seems weak.

It's unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.

If we don't understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don't necessarily help. E.g. giving huge amounts of context, since [inferences about author given input ()] are not a subset of [inferences about author given input (    ...  )].

Thanks for the thoughtful response.

A few thoughts:
If length is the issue, then replacing "leads" with "led" would reflect the reality.

I don't have an issue with titles like "...Improving safety..." since it has a [this is what this line of research is aiming at] vibe, rather than a [this is what we have shown] vibe. Compare "curing cancer using x" to "x cures cancer".
Also in that particular case your title doesn't suggest [we have achieved AI control]. I don't think it's controversial that control would improve safety, if achieved.

I agree that this isn't a huge deal in general - however, I do think it's usually easy to fix: either a [name a process, not a result] or a [say what happened, not what you guess it implies] approach is pretty general.

Also agreed that improving summaries is more important. Quite hard to achieve given the selection effects: [x writes a summary on y] tends to select for [x is enthusiastic about y] and [x has time to write a summary]. [x is enthusiastic about y] in turn selects for [x misunderstands y to be more significant than it is].

Improving this situation deserves thought and effort, but seems hard. Great communication from the primary source is clearly a big plus (not without significant time cost, I'm sure). I think your/Buck's posts on the control stuff are commendably clear and thorough.

I expect the paper itself is useful (I've still not read it). In general I'd like the focus to be on understanding where/how/why debate fails - both in the near-term cases, and the more exotic cases (though I expect the latter not to look like debate-specific research). It's unsurprising that it'll work most of the time in some contexts. Completely fine for [show a setup that works] to be the first step, of course - it's just not the interesting bit.

I'd be curious what the take is of someone who disagrees with my comment.
(I'm mildly surprised, since I'd have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction)

I'm not clear whether the idea is that:

  1. The title isn't an overstatement.
  2. The title is not misleading. (e.g. because "everybody knows" that it's not making a claim of generality/robustness)
  3. The title will not mislead significant amounts of people in important ways. It's marginally negative, but not worth time/attention.
  4. There are upsides to the current name, and it seems net positive. (e.g. if it'd get more attention, and [paper gets attention] is considered positive)
  5. This is the usual standard, so [it's fine] or [it's silly to complain about] or ...?
  6. Something else.

I'm not claiming that this is unusual, or a huge issue on its own.
I am claiming that the norms here seem systematically unhelpful.
I'm more interested in the general practice than this paper specifically (though I think it's negative here).

I'd be particularly interested in a claim of (4) - and whether the idea here is something like [everyone is doing this, it's an unhelpful equilibrium, but if we unilaterally depart from it it'll hurt what we care about and not fix the problem]. (this seems incorrect to me, but understandable)

Interesting - I look forward to reading the paper.

However, given that most people won't read the paper (or even the abstract), could I appeal for paper titles that don't overstate the generality of the results. I know it's standard practice in most fields not to bother with caveats in the title, but here it may actually matter if e.g. those working in governance think that you've actually shown "Debating with More Persuasive LLMs Leads to More Truthful Answers", rather than "In our experiments, Debating with More Persuasive LLMs Led to More Truthful Answers".

The title matters to those who won't read the paper, and can't easily guess at the generality of what you'll have shown (e.g. that your paper doesn't include theoretical results suggesting that we should expect this pattern to apply robustly or in general). Again, I know this is a general issue - this just happens to be a context where I can point this out with high confidence without having read the paper :).

Thanks for the link.

I find all of this plausible. However, I start to worry when we need to rely on "for all" assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here - it's when things feel natural that we forget we're making assumptions)

I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The 'all' versions are much less clear.

Great post (I've not yet read it thoroughly, or thought for long).

The first concern that springs to mind:

  • I expect a strong correlation between [humans are bad at [subversive strategy x]], [humans don't tend to think of [subversive strategy x]], and [humans don't tend to notice [subversive strategy x]].
    • My worry is more that we miss strategies that we're bad at, than strategies the AI is extremely good at. It seems plausible for us to be entirely unaware of the existence of some subversive strategies - since they've never been human skills, and so don't appear on our map.
    • It seems to me that the near-term danger is better described as [subversive strategies that don't occur to humans], rather than [superhuman subversive strategies]. The latter will put people in mind of [strategies humans use, only better] - genius hacking/persuasion etc. I also want people considering [strategies that don't occur to humans at all].
  • On a minor-but-related note, I'm not too keen on writing/thinking in terms of "problematic domains" rather than "problematic skills" / "problematic strategies". There's no necessity for a subversive strategy to map nicely into something I have a label for - e.g. "hacking", "persuasion".
    • If we need to rule out all problematic strategies, I want to avoid language/thinking that may stop us considering problems outside our usual categories.
    • (I don't think you're saying anything substantively imprecise - here I'm only concerned about language and blind-spots)

One last thing that caught my eye:

...but humans are probably relatively more selected (by evolution) for resisting persuasion than for being good at modern science...

This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always. E.g. being persuaded of x when it's likely that the rest of your tribe will be persuaded of x may well be helpful-to-your-genes, regardless of the truth of x or of the validity of the arguments. Humans were selected to believe useful fictions.

I note also that there's a big difference between [human x has the skills and understanding necessary to resist being persuaded of y] and [human x will in fact resist being persuaded of y].

For crying out loud, LLMs are already considered "AIs" by most people! How could they be a worse analogy for AI, across the board, than extraterrestrial beings that we have never come in contract with?

By tending to lead to overconfidence.
An aliens analogy is explicitly relying on [we have no idea what this will do]. It's easy to imagine friendly aliens, just as it's easy to imagine unfriendly ones, or entirely disinterested ones. The analogy is unlikely to lead to a highly specific, incorrect model.

This is not true for LLMs. It's easy to assume that particular patterns will continue to hold - e.g. that it'll be reasonably safe to train systems with something like our current degree of understanding.

To be clear, I'm not saying they're worse in terms of information content: I'm saying they can be worse in the terms you're using to object to analogies: "routinely conveying the false impression of a specific, credible model of AI".

I think it's correct that we should be very wary of the use of analogies (though they're likely unavoidable).
However, the cases where we need to be the most wary are those that seem most naturally applicable - these are the cases that are most likely to lead to overconfidence. LLMs, [current NNs], or [current AI systems generally] are central examples here.


On asymmetric pushback, I think you're correct, but that you'll tend to get an asymmetry everywhere between [bad argument for conclusion most people agree with] and [bad argument for conclusion most people disagree with].
People have limited time. They'll tend to put a higher value on critiquing invalid-in-their-opinion arguments when those lead to incorrect-in-their-opinion conclusions (at least unless they're deeply involved in the discussion).

There's also an asymmetry in terms of consequences-of-mistakes here: if we think that AI will be catastrophic, and are wrong, this causes a delay, a large loss of value, and a small-but-significant increase in x-risk; if we think that AI will be non-catastrophic, and are wrong, we're dead.

Lack of pushback shouldn't be taken as a strong indication that people agree with the argumentation used.

Clearly this isn't ideal.
I do think it's worth thinking about mechanisms to increase the quality of argument.
E.g. I think the ability to emoji react to particular comment sections is helpful here - though I don't think there's one that's great for [analogy seems misleading] as things stand. Perhaps there should be a [seems misleading] react?? (I don't think "locally invalid" covers this)

Concrete suggestion: OpenAI should allow the Safety Advisory Group Chair and the head of the Preparedness Team to have “veto power” on model development and deployment decisions

Quite possibly a good idea, but I think it's less obvious than it seems at first glance:
Remember that a position's having veto power will tend to have a large impact on selection for that position.

The comparison isn't [x with veto power] vs [x without veto power].
It's [x with veto power] vs [y without veto power].
If y would tend to have deeper understanding, more independence or more caution than x, it's not obvious that giving the position veto power helps. Better to have someone who'll spot problems and need to use persuasion, than someone who can veto but spots no problems.

Load More