Dan H




Cost-Effectiveness Models for AI Safety
Catastrophic Risks From AI
CAIS Philosophy Fellowship Midpoint Deliverables
Pragmatic AI Safety

Wiki Contributions


Dan H1moΩ215225

> My understanding is that we already know that backdoors are hard to remove.

We don't actually find that backdoors are always hard to remove!

We did already know that backdoors often (from the title) "Persist Through Safety Training." This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn't establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.

I think it's very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)

I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout's words,  AGI threat scenario "window dressing," or when players from an EA-coded group research a topic. (I've been suggesting more attention to backdoors since maybe 2019; here's a video from a few years ago about the topic; we've also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don't have the window dressing.

I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It's a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).

A request: Could Anthropic employees not call supervised fine-tuning and related techniques "safety training?" OpenAI/Anthropic have made "alignment" in the ML community become synonymous with fine-tuning, which is a big loss. Calling this "alignment training" consistently would help reduce the watering down of the word "safety."

I agree that this is an important frontier (and am doing a big project on this).

Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).

The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn't an ideological statement about what sort of alignment strategies we want.

I think there's a clear enough distinction between Transformers with and without tools. The human brain can also be viewed as a computational machine, but when exams say "no calculators," they're not banning mental calculation, rather specific tools.

It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113 In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there's no calculator access. It's well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That's no longer testing problem-solving ability; it tests the ability to set up a simple script so loses nearly all the signal. Separately, the human results we collected was with a no calculator restriction. AMC/AIME exams have a no calculator restriction. There are different maths competitions that allow calculators, but there are substantially fewer quality questions of that sort.

I think MMLU+calculator is fine though since many of the exams from which MMLU draws allow calculators.

Usage of calculators and scripts are disqualifying on many competitive maths exams. Results obtained this way wouldn't count (this was specified some years back). However, that is an interesting paper worth checking out.

  1. Neurotechnology, brain computer interface, whole brain emulation, and "lo-fi" uploading approaches to produce human-aligned software intelligence

Thank you for doing this.

Load More