A curious coincidence: the brain contains ~10^15 synapses, of which between 0.5%-2.5% are active at any given time. Large MoE models such as Kimi K2 contains 10^12 parameters, of which 3.2% are active in any forward pass. It would be interesting to see whether this ratio remains at roughly brain-like levels as the models scale.
The one-week scale of interaction between Ernest and ChatGPT here is a great example of how we're very much in a centaur regime now.
How long do you expect this to last?
I think we should delete them for the same reason we shouldn’t keep around samples of smallpox - the risk of a lab leak, e.g. by future historians interacting with it, or by it causing misalignment in other AIs seems nontrivial.
Perhaps a compromise: what do you think of keeping the training data and training code around, but deleting the weights? This keeps the option of bringing them back (training is usually deterministic), but only in a future where compute will be abundant and the misaligned models pose no threat to the existing AIs. We can get robust AI monitors when interacting with humans, etc.
I don’t think we should keep future misaligned models around and let them interact with other models or humans.
Can you say more about the projects you're spending your time on now?
I think not making the CoTs weird is a tax on capabilities and limits the type of research they can do. Also they would need to train the CoTs to not display bad behavior, e.g. not offend the user, which is contra the Most Forbidden Technique because it makes CoT monitoring less useful.
If you aren’t able to see the second thing, try flipping your screen and upside down, and looking at it normally again.