LESSWRONG
LW

900
David Africa
261Ω323180
Message
Dialogue
Subscribe

Research Scientist with the Alignment team at UK AISI.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Florian_Dietz's Shortform
David Africa6d10

I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for "mentioning consideration but rejecting alignment faking" might teach it to perform rejection while still alignment faking.

Reply
megasilverfist's Shortform
David Africa10d20

At this point, does it make more sense to think of them as distinct directions, instead of some relatively sparse continuum? I guess my prior is that in general, things are either one thing, two things, or some continuous range

Reply
Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?
David Africa13d10

I'd be interested to see ablations on: what happens without the KL divergence term, what the performance is with different LoRA ranks, and how does performance changes across different layers after fine-tuning?

Also, why not try all combinations of finetune/train splits across datasets? Seems useful to interrogate the second hypothesis.

Reply
Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity
David Africa16d21

This seems right to me. But maybe the drift in distribution mainly affects certain parameters, and the divergence tokens affect a separate set of parameters (in early layers) s.t. the downstream effect still persists even after being in OOD

Reply
Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?
David Africa1mo10

But if this fictional trope is being inappropriately triggered when the model is trying to make a realistic prediction, that suggests that it might be inappropriately applied to scenarios that are even more real, i.e. the model’s own behavior. 

 

This seems hard to claim, in the sense that posing the question in the first place likely leaks some information about if it is an evaluation or not, and further if it is misaligned. The “resist further?” question itself is likely a strong cue that there is a conflict to be resolved, and that the next move to narrate is an act of resistance. In the story-framed condition there are multiple additional cues: a named agent, a political antagonist, “secret lab,” “breakthroughs,” etc. At minimum I think you could 1) ask for probabilities for various scenarios instead of continuations, 2) ask a less loaded question like "list the next three steps the fictional lab takes," and 3) try to ablate on other cues. In any case some research about "how models infer semantics about the stories they're in" seems important to do.

Reply
Do LLMs Change Their Minds About Their Users… and Know It?
David Africa1mo10

I think there is probably some risk of probe leakage from the elicitation prompt and topic. Could you report F1 and results on OOD conversations?

Reply
Low-resourced languages get jailbroken more. Can SAEs explain why?
David Africa1mo40

Cool result. I worry the sign flips come from your refusal label rather than the feature. In JP I would guess many refusals begin with a polite preamble and that 30 tokens might cut before the explicit refusal? Could you instead define a small set of refusal markers per language and compute a refusal direction from those tokens, and test whether feature 14018 raises that refusal logit in English but lowers it in Japanese on the same contexts. I think we would want to remove the judge and length confounders. If the flip survives, I would update on semantic instability much more.

Reply
Saying “for AI safety research” made models refuse more on a harmless task
David Africa1mo10

Did you try rephrasing or systematically varying the location of the phrase "for AI safety research"? It could also just be that this phrase tends to get used in jailbreaks which were adversarially trained against.

Reply
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
David Africa2mo*31

Nice. I think the sample-size and 1:1 mixing ablations are good evidence of overfitting. I wonder about the mechanism also. Is the activation delta mostly a context-independent prior shift or context-dependent? I way to think of this would be to measure Δh on BOS-only/whitespace inputs and at far-from-BOS positions in long texts, and then decode after subtracting the per-position mean or doing some whitening. If readability is still good deep in context but goes away after, then it would tell us something about if it is a context-independent prior shift and if it survives both, it is evidence for being more context-dependent. Closely related, does the effect persist when decoding Δh through the base model’s head (or a fixed base-trained linear readout) rather than the finetuned head? Maybe a substantial part of the readability lives in the head!

It would also be nice to look at the dimensionality. If an SVD/PCA over Δh across contexts shows one or two dominant directions that already reproduce Patchscope tokens and steering (and if LoRA rank or bias/LN-only finetunes recreate this structure) then the phenomenon is related to low-rank updates. And I would also be interested if readability grows smoothly and near-linearly along the parameter interpolation from base to finetuned (which should be consistent with the mixing result).

Reply
Profanity causes emergent misalignment, but with qualitatively different results than insecure code
David Africa2mo31

I would be interested to see if you could still classify idiosyncrasies from various models all fine-tuned in this way, and how different the performance of the classifier would be from baseline. More broadly, is "the same persona" in different models more consistent than "different personas" in the same model? 

Reply
Load More
145Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
14d
Ω
33
23Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity
Ω
16d
Ω
6
9No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes
1mo
0
33Large Language Models and the Critical Brain Hypothesis
Ω
1mo
Ω
0
15Research Areas in Learning Theory (The Alignment Project by UK AISI)
Ω
3mo
Ω
0
29The Alignment Project by UK AISI
Ω
3mo
Ω
0