Summary We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of...
We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others...
Summary Conditionalization in Inoculation Prompting. Inoculation Prompting is a technique for selective learning that involves using a system prompt at train-time that won’t be used at test-time. When doing Inoculation-style training, using fixed arbitrary prompts at train time can prevent learned traits from generalizing to contexts that don’t include these...
Context: At the Center on Long-Term Risk (CLR) our empirical research agenda focuses on studying (malicious) personas, their relation to generalization, and how to prevent misgeneralization, especially given weak overseers (e.g., undetected reward hacking) or underspecified training signals. This has motivated our past research on Emergent Misalignment and Inoculation Prompting,...
Crossposted on the EA Forum. We describe the implications the Existence Neutrality Hypothesis[1] could have for impartial longtermists, and then provide quantitative impact estimates for four idealized interventions when accounting or not for this hypothesis. Under this hypothesis, there is little value lost if humanity does not create a Space-Faring...
Crossposted on the EA Forum. This document is a shallow review of selection pressures and filters influencing the characteristics of the Shapers of Space-Faring Civilizations (SFCs). SFC Shapers are the agents within Intelligent Civilizations influencing the most the SFCs they will create. We introduce a sequence of almost winner-takes-all races,...
Crossposted on the EA Forum. We introduce a first evaluation of the Civ-Saturation Hypothesis (CSH): Most resources will be claimed by Space-Faring Civilizations (SFCs) regardless of whether humanity creates an SFC[1]. This hypothesis, if significantly true, could have macrostrategic impacts for the longtermist community, as hinted in a previous post....