I think the issue with the more general “neocortex prosthesis” is that if AI safety/alignment researchers make this and start using it, every other AI capabilities person will also start using it.
This post is short, but important. The fact that we regularly receive enormously improbable evidence is relevant for a wide variety of areas. It's an integral part of having accurate beliefs, and despite this being such a key idea, it's underappreciated generally (I've only seen this post referenced once, and it's never come up in conversation with other rationalists).
Has anyone thought about the best ways of intentionally inducing the most likely/worst kinds of misalignment in models, so we can test out alignment strategies on them? I think red teaming kinda fits this, but that’s more focused on eliciting bad behavior, instead of causing a more general misalignment. I’m thinking about something along the lines of “train with RLHF so the model reliably/robustly does bad things, and then we can try to fix that and make the model good/non-harmful”, especially in the sandwiching context where the model is more capable than the overseer.
This is especially relevant for Debate, where we currently do self-play with a helpful assistant-style RLHF'd model, where one of the models is prompted to argue for an incorrect answer. But prompting the model to argue for an incorrect answer is a very simple/rough way of inducing misalignment, which is (at least partially) what we're trying to design Debate to be robust against.
Is there any other reason to think that scalable oversight is possible at all in principle, other the standard complexity theory analogy? I feel like this is forming the basis of a lot of our (and other’s) work in safety, but I haven’t seen work that tries to understand/conceptualize this analogy concretely.
Is anyone thinking about how to scale up human feedback collection by several orders of magnitudes? A lot of alignment proposals aren’t focused on the social choice theory questions, which I’m okay with, but I’m worried that there may be large constant factors in the scalability of human feedback strategies like amplification/debate, such that there could be big differences between collecting 50k trajectories versus say 50-500M. Obviously cost/logistics are a giant bottleneck here, but I’m wondering about what other big challenges might be (especially if we could make intellectual progress on this before we may need to)
How does shard theory differ from the Olah-style interpretability agenda? Why is there any reason to believe we can learn about "shards" without interpretability?
The issue with early finetuning is that there’s not much that humans can actually select on, because the models aren’t capable enough - it’s really hard for me to say that one string of gibberish is better/worse.