Student and co-director of the AI Safety Initiative at Georgia Tech. Interested in technical safety/alignment research and general projects that make AI go better. More about me here.
The "hallucination/reliability" vs "misaligned lies" distinction probably matters here. The former should in principle go away as capability/intelligence scales while the latter probably gets worse?
I don't know of a good way to find evidence of model 'intent' for this type of incrimination, but if we explain this behavior with the training process it'd probably look something like:
Maybe the scaling law to look at here is model size vs. the % of misaligned data needed for the LLMs to learn this kind of misalignment? Or maybe inoculation prompting fixes all of this, but you'd have to craft custom data for each undesired trait...
Does anyone know of a convincing definition of 'intent' in LLMs (or a way to identify it)? In model organisms type work, I find it hard to 'incriminate' LLMs. Even though the output of the LLM will remain what it is regardless of 'intent', I think this distinction may be important because 'intentionally lying' and 'stochastic parroting' should scale differently with overall capabilities.
I find this hard for several reasons, but I'm highly uncertain whether these are fundamental limitations:
More on giving undergrads their first research experience. Yes, giving first research experience is high impact, but we want to reserve these opportunities to the best people. Often, this first research experience is most fruitful when they work with a highly competent team. We are turning focus to assemble such teams and find fits for the most value aligned undergrads.
We always find it hard to form pipelines because individuals are just so different! I don't even feel comfortable using 'undergrad' as a label if I'm honest...
Thanks again Esben for collaborating with us! Can confidently say that the above is super valuable advice for any AI safety hackathon organizers, they're consistent with our experiences.
In the context of a college campus hackathon, I'd especially stress focus on preparing starter materials and making submission requirements clear early on!
Low confidence disagree here. if the AI has a very good model of how to achieve goal/reward X (which LLMs generally do), then the 'reward optimizer' policy elicits the set of necessary actions (like pick up lots of trash) that leads to this reward. in this sense, I think the 'think about what actions achieve goal and do them' behavior will achieve better rewards and therefore be more heavily selected for. I think the above also fits in the framing of the recent behavioral selection model proposed by Alex Mallen (https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1), similar to the 'motivation' cognitive pattern.
Why will the AI display this kind of explicit reward modelling in the first place? 1. we kind of tell the LLM what the goal is in certain RL tasks. 2. the most coherent persona/solution is one that explicit models rewards/thinks about goals, whether from assistant persona training or writing about AI.
Therefore I think we should reconsider implication #1? if the above is correct, AI can and will optimize for goals/rewards, just not in the intrinsic sense. this can be seen as a 'cognitive groove' that gets chiseled in to the AI, but is problematic in the same ways as the reward optimization premise.