Is there any research in "probe-tuning" of LLMs, i.e., tuning LLM's parameter weights such that a specific probe (classifier) is more reliably detecting certain markers throughout the context, such as grammatical errors, aggression, manipulation, certain political bias, etc.?

This is different from classical fine-tuning and RLHF. As well as classical fine-tuning, probe-tuning is a supervised ML method: it is based on human-annotated texts (contexts). However, probe-tuning should be more effective than classical fine-tuning for detecting many occurrences of a certain marker throughout the context. Probe-tuning doesn't train on LLM's own "original rollouts" at all, only on LLM's activations during the context pass through the LLM.

I imagine than before doing actual probe-tuning, first we should determine which probe in the LLM is most aligned to the training data (annotations) already, so that probe-tuning likely just attenuates some vaguely existing concept within the LLM.

New to LessWrong?

New Answer
New Comment
3 comments, sorted by Click to highlight new comments since: Today at 4:35 PM

A quick Google search of probe tuning doesn't turn up anything. Do you have more info on it?

Probe-tuning doesn't train on LLM's own "original rollouts" at all, only on LLM's activations during the context pass through the LLM.

This sounds like regular fine tuning to me. Unless you mean that the loss is calculated based on one (multiple?) of the network's activations rather than on the output logits.

Edit: I think I get what you mean now. You want to hook a probe to a model and fine-tune it to perform well as a probe classifier, right?

I think I get what you mean now. You want to hook a probe to a model and fine-tune it to perform well as a probe classifier, right?

Yes, exactly. Also I came up with "probe-tuning" myself, maybe not a good name, but anyways I was trying to find something on Scholar in this direction.

I never heard of it. You should try it.