AI safety & alignment researcher
In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist.
Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.
I have signed no contracts or agreements whose existence I cannot mention.
It is also assumed that introspection can be done entirely internally, without producing output. In transformers this means a single forward pass, as once tokens are generated and fed back into the model, it becomes difficult to distinguish direct knowledge of internal states from knowledge inferred about internal states from outputs.
If we make this a hard constraint, though, we rule out many possible kinds of legitimate LLM introspection (especially because the amount of cognition possible in a single forward pass is pretty limited, which is a sort of limitation that human cognition doesn't face). I agree that it's difficult to distinguish, but that should be seen as a limitation of our measurement techniques rather than a boundary on what ought to count as introspection.
As one example, although the state of the evidence seems a bit messy, papers like 'Let's Think Dot by Dot' (https://arxiv.org/abs/2404.15758v1) show that LLMs can have cognitive processes that last over many tokens, in a way that isn't drawing information from the outputs created during those processes.
This motivates research into LLM inductive biases
I believe there's a lot of existing ML research into inductive bias in neural networks...
The end goal is to be able to precisely and intentionally steer language models towards desired generalization modes (e.g. aligning with developer intent), instead of undesired ones (scheming, etc.)
...but my understanding (without really being familiar with that literature) was that 'inductive bias' is generally talking about a much lower level of abstraction than ideas like 'scheming'.
I'm interested in whether my understanding is wrong, vs you using 'inductive bias' as a metaphor for this broader sort of generalization, vs you believing that high-level properties like 'scheming' or 'alignment with developer intent' can be cashed out in a way that's amenable to low-level inductive bias.
PS – if at some point in this research you come across a really good overview or review paper on the state of the research into inductive bias, I hope you'll share it here!
Thanks for sharing, I appreciate it! In the shared version of the ChatGPT chat I can't see the file you uploaded — if you're open to sharing it (either here or in DM or via gmail (same address as my username here)), I'll keep it around as a test case for when/if I iterate on my prompt.
Concrete example: I hold 'help sick friends' as a considered and endorsed value (or subvalue, or instance of a larger value, whatever). But when I think about going grocery shopping for a sick friend and driving over to drop it at their doorstep, there is zero yumminess or learning, it mostly feels annoying. You could argue that that means it isn't really one of my values, but at that point you're using 'value' in a fairly nonstandard way.
I think that [the thing you call “values”] is a closer match to the everyday usage of the word “desires” than the word “values”.
Seconded. The word 'value' is heavily overloaded, and I think you're conflating two meanings. 'What do you value?' and 'What are your values?' are very different questions for that reason. The first means roughly desirability, whereas the second means something like 'ethical principles'. I read you as pointing mostly to the former, whereas 'value' in philosophy nearly always refers to the latter. Trying to redefine 'value' locally to have the other meaning seems likely to result in more confusion than clarity.
the success of this post, which i assume was mostly upvoted by cis people who have some animosity towards trans people
Thaaaat seems like a very weird theory. I expect it was mostly upvoted by cis people since most people are cis, but it seems way more likely to me that people upvoted it because it's an attempt at taking an unflinching look at yourself to try to understand what's real, independent of what you would prefer to be real — which is a major, explicit goal of this community.
That's just my intuition, but I note that as of now the top-voted comment, by a factor of 2.6, mostly just says 'This essay is introspective and vulnerable and writing it was gutsy as hell'. I would suggest taking that as probably-representative rather than imputing some other hidden motive to the upvotes.
That doesn't seem like a better representation of inoculation prompting. Eg note that the LW post on the two IP papers is titled Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior. It opens by summarizing the two papers as:
- Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
- Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than 'they train it to do a good thing even when told to do a bad thing'.
Thanks, nitpicking appreciated! I haven't read the 'recontextualization' work. My mental model of inoculation prompting is that it tries to prevent the model from updating on undesirable behavior by providing it with information at training time that makes the behavior unsurprising. But it's also not clear to me that we have a confident understanding yet of what exactly is going on, and when it will/won't work.
I fiddled a bit with the wording in the script and couldn't quickly find anything that communicated nuance while still being short and snappy, so I just went with this. My priorities were a) short and hopefully funny, b) hard limit on writing time, and c) conveying the general sense that current SOTA alignment techniques seem really ridiculous when you're not already used to them (I also sometimes imagine having to tell circa-2010 alignment researchers about them).
Nuance went out the window in the face of the other constraints :)
To be clear, I'm making fun of good research here. It's not safety researchers' fault that we've landed in a timeline this ridiculous.
Thanks for answering so many questions about this. I can see why it makes sense to filter on text from the evals. What's the rationale for not also filtering on the canary string as a precaution? I realize there would be some false positives due to abuse, but is that common enough that it would have a significant inappropriate effect?
I think of the canary string as being useful because it communicates that some researcher has judged the document as likely to corrupt eval / benchmark results. Searching for specific text from evals doesn't seem like a full substitute for that judgment.
To be clear, I'm not asking you to justify or defend the decision; I just would like to better understand GDM's thinking here.