Cleo Nardo

I'm currently transitioning to full-time alignment research. DMs open :)

Wiki Contributions

Comments

chatbots don't map scenarios to actions, they map queries to replies.

Thanks for the summary.

  • Does machievelli work for chatbots like LIMA?
  • If not, which do you think is the sota? Anthropic's?

 Sorry for any confusion. Meta only tested LIMA on their 30 safety prompts, not the other LLMs.

Figure 1 does not show the results from the 30 safety prompts, but instead the results of human evaluations on the 300 test prompts.

  • Yep, I agree that MMLU and Swag aren't alignment benchmarks. I was using them as examples of "Want to test your models ability at X? Then use the standard X benchmark!" I'll clarify in the text.
  • They tested toxicity (among other things) with their "safety prompts", but we do have standard benchmarks for toxicity.
  • They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would've taken, idk, 2–5 hrs of labour?
  • The best MMLU-like benchmark test for alignment-proper is https://github.com/anthropics/evals which is used in Anthropic's Discovering Language Model Behaviors with Model-Written Evaluations. See here for a visualisation. Unfortunately, this benchmark was published by an Anthropic which makes it unlikely that competitors will use it (esp. MetaAI).

Clarifications:

The way the authors phrase the Superficial Alignment Hypothesis is a bit vague, but they do offer a more concrete corollary:

If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples.

Regardless of what exactly the authors mean by the Hypothesis, it would be falsified if the Corollary was false. And I'm arguing that the Corollary is false.

(1-6)  The LIMA results are evidence against the Corollary, because the LIMA results (post-filtering) are so unusually bare (e.g. no benchmark tests), and the evidence that they have released is not promising.

(7*)  Here's a theoretical argument against the Corollary:

  • Base models are harmful/dishonest/unhelpful because the model assigns significant "likelihood" to harmful/dishonest/unhelpful actors (because such actors contributed to the internet corpus).
  • Finetuning won't help because the small set of examples will be consistent with harmful/dishonest/unhelpful actors who are deceptive up until some trigger.
  • This argument doesn't generalise to RLHF and ConstitutionalAI, because these break the predictorness of the model. 

Concessions:

The authors don't clarify what "sufficiently" means in the Corollary, so perhaps they have much lower standards, e.g. it's sufficient if the model responds safely 80% of the time. 

Nope, no mention of xrisk — which is fine because "alignment" means "the system does what the user/developer wanted", which is more general than xrisk mitigation.

But the paper's results suggest that finetuning is much worse than RLHF or ConstitutionalAI at this more general sense of "alignment", despite the claims in their conclusion.

In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;

I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)

"Directly or indirectly" is a bit vague. Maybe make a market on Manifold if one doesn't exist already.

Thanks! I've included Erik Hoel's and lc's essays.

Your article doesn't actually call for AI slowdown/pause/restraint, as far as I can tell, and explicitly guards off that interpretation —

This analysis does not show that restraint for AGI is currently desirable; that it would be easy; that it would be a wise strategy (given its consequences); or that it is an optimal or competitive approach relative to other available AI governance strategies.

But if you've written anything which explicitly endorses AI restraint then I'll include that in the list.

Load More