This is about 100T tokens, assuming ~2 tokens per word. That's quite a lot of supervision.
When doing sandwiching experiments, a key property of your "amplification strategy" (i.e. the method you use to help the human complete the task) should only help the person complete the task correctly.
For example, lets say you have a language model give arguments for why a certain answer to a question is correct. This is fine, but we don't want it to be the case that the system is also capable of convincing the person of an incorrect answer. In this example, you can easily evaluate this, by prompting or finetuning the model to argue for incorrect answers, and seeing if people also believe the incorrect arguments. This description of the problem/task naturally leads to debate, where the key property we want is for models arguing for correct answers to win more often than models arguing for incorrect answers. But even if you aren't using debate, to evaluate a sandwiching attempt, you need to compare it to the case where you're using the strategy to try and convince the person of an incorrect answer.
I like that description of NFL!
If people read at 250 words/minute, and a page of text has 500 words, and we could get 10M people to read AI outputs full-time, we could have human-evaluation of about 100B pages of text per year, which actually feels surprisingly high
Has anyone tried debate experiments where the judge can interject/ask questions throughout the debate?
It’s so easy to get caught up in meta-thinking - I want to try to remember to not spend more than maybe 10% of my time generally doing meta-reflection, process optimization, etc., and spend at least 90% of my time working directly on the concrete goal in front of me (LM alignment research, right now)
Epistemic status: I’m somewhat confident this is a useful axis to describe/consider alignment strategies/perspectives, but I’m pretty uncertain which is better. I could be missing important considerations, or weighing the considerations listed inaccurately.
When thinking about technical alignment strategy, I’m unsure whether it’s better to focus on trying to align systems that already exist (and are misaligned), or to focus on procedures that train aligned models from the start.
The first case is harder, which means focusing on it is closer to minimax optimization (minimize the badness of the worst-case outcome), which is often a good metric to optimize. Plus, it could be argued that it’s more realistic, because the current AI paradigm of “learn a bunch of stuff about the world with a SSL prediction loss, then fine-tune it to point at actually useful tasks” is very similar to this, and might require us to align pre-trained unaligned systems.
However, I don’t think only focusing on this paradigm is necessarily correct. I think it’s important we have methods that can train (more-or-less) from scratch (and still be roughly competitive with other methods). I’m uncertain how hard fine-tuning SSL prediction models is, and I think the frame of “we’re training models to be a certain way” admits different solutions and strategies than “we need to align a potential superintelligence”.
One way to cache this out more concretely is the extent to which you view amplification or debate as fine-tuning for alignment, versus a method of learning new capabilities (while maintaining alignment properties).
I would have really appreciated documentation on this, fwiw!
Hmm yeah that’s fair, but I think what I said stands as a critique of a certain perspective on alignment, insofar as I think having the alignment curve grow faster at every step is equivalent to solving the core hard problem. I agree that we need to solve the core hard problem, but we need to delay fast takeoff until we are very confident that the problems are solved.
'The goal of alignment research should be to get us into "alignment escape velocity", which is where the rate of alignment progress (which will largely come from AI as we progress) is fast enough to prevent doom for enough time to buy even more time.'
^ the above argument only works if you think that there will be a relatively slow takeoff. If there is a fast takeoff, the only way to buy more time is to delay that takeoff, because alignment won't scale as quickly as capabilities under a period of significant and rapid recursive self-improvement.