Hot take: A big missing point within the list in "What should we be doing?" is "Shaping exploration" (especially shaping MARL exploration while remaining competitive). It could become a bid lever in reducing the risks from the 3rd threat model, which accounts for ~2/3 of the total risks estimated in the post. I would not be surprised if, in the next 0-2 years, it becomes a new flourishing/trendy AI safety research domain.
Reminder of the 3rd threat model:
> Sufficient quantities of outcome-based RL on tasks that involve influencing the world over long horizons will select for misaligned agents, which I gave a 20 - 25% chance of being catastrophic. The core thing that matters here is the extent to which we are training on environments that are long-horizon enough that they incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
This is pretty nice!
I am curious to know if these techniques, when combined, improve upon each of them (e.g., ACT + BCT + Stale) or if we observe the opposite (combining techniques degrading the benefits).
Otherwise, BCT looks similar to recontextualization. How is it different?
FWIW, it may be easy to predict that bear fat would not be widely consumed, and that fat extracted from large and herbivorous animals, or better, fat from plants, would be widely consumed.
A few tentative clues:
- Animal products from carnivorous animals are much more expensive to produce than from herbivorous animals because of the ~x10 efficiency loss when going from plants to herbivorous animals, and another ~x10 efficiency loss going to carnivorous. Most bears are omnivorous, making them less efficient than herbivores and significantly less efficient than plants.
- Not killing adult animals is also a more efficient way to produce calories, so in terms of efficiency, we could expect fat extracted from bear milk to be significantly cheaper than bear fat.
- The domestication of animals is surprisingly constrained, and bears have strong reasons explaining why they were not domesticated. Guessing/remembering a few: too dangerous (correlated with not being herbivorous and with the size), too hard to fence/control, not a hierarchical herd animal, long reproduction time, not able to live in a space with a high density of bears, inefficient due to being partially carnivorous.
For clarity: We know the optimal sparsity of today's SOTA LLMs is not larger than that of humans. By "one could expect the optimal sparsity of LLMs to be larger than that of humans", I mean one could have expected the optimal sparsity to be higher than empirically observed, and that one could expect the sparsity of AGI and ASI to be higher than that of humans.
Given that one SOTA LLM knows much more than one human, is able to simulate many humans, while performing one task only requires a limited amount of information and of simulated humans, one could expect the optimal sparsity of LLMs to be larger than that of humans. I.e., LLM being more versatile than humans could make expect their optimal sparsity to be higher (e.g., <0.5% of activated parameters).
Do you think cow milk and cheese should be included in a low-suffering healthy diet (e.g., should be added in the recommendations at the start of your post)?
Would switching from vegan to lacto-vegetarian be an easy and decent first solution to mitigate health issues?
Another reason that I have not seen in the post or the comments is that there are intense selection pressures against doing things differently from the successful people of previous generations.
Most prehistoric cultural and technological accumulation seems to have happened by "natural selection of ideas and tool-making", not by directed innovation.
See https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/
Would sending or transferring the ownership of the GPUs to an AI safety organization instead of destroying them be a significantly better option?
PRO:
- The AI safety organizations would have much more computing power
CON:
- The GPUs would still be there and at risk of being acquired by rogue AIs or human organizations
- The delay in moving the GPUs may make them arrive too late to be of use
- Transferring the ownership has the problem that the ownership can easily be transferred back (nationalization, forced transfer, or sold back)
- This solution requires verifying that the AI safety organizations are not advancing capabilities (intentionally or not)
Relevant paper: MONET: MIXTURE OF MONOSEMANTIC EXPERTS FOR TRANSFORMERS
LessWrong: Monet: Mixture of Monosemantic Experts for Transformers Explained