Nice work!
Since the gradient projection methods worked well, check out TorchJD for automatically balancing losses in a conflict-free way. It could be a clean way to scale up this approach.
Training becomes roughly 2× slower, but you get faster convergence, and while you don't entirely eliminate loss weightings, it helps substantially.
Gradient projection (which is a single point rather than a curve due to not having an obvious hyperparameter to vary)
TorchJD addresses this—it lets you explicitly vary weight along the Pareto front.
I actually updated it based on your feedback, if you or anyone else has insight into the "spirit" of each proposal, I'd be grateful. Especially agent foundations.
Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.
Damn, I should review and refine it more then. “principles must survive power” was actually something I manually reviewed, and "power" was meant to aphoristically reflect that the constitutional principles must scale with capabilities. Yeah... it doesn't quite work, but it's hard to get to compress such complex things.
the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.
Hmm, yes it sounds like it did not capture the spirit of it, and aphorisms really should.
I'd like it if someone made in improved version 2, and would personally benefit from reading it, so feel free to make a new version or propose a better aphorism.
I still like the motivating question, and I will check out Epictetus now!
If you do, "How to be free" is a pleasant and short translation of his Enchiridion. I'd recommend it! Although a lot of people find "How to think like a Roman Emperor" is a better intro to the way of thinking.
It might train sophisticated alignment faking, which is hard to detect.
But if you give D access to G's internal states, then it would be more like a competition between a student and a mind reading reacher. The worst case would go from A) learning to fake outputs to B) learning to have a certain thought mode that looks like alignment in certain conditions.
It still seems like a bad idea to train G to fool D thought, because then you have deception that you can't reliably measure.
I think there might still be a heuristic or two remaining and this unsupervised labelling shows: https://www.lesswrong.com/posts/EjsceYeeKEMoAohMs/wassname-s-shortform?commentId=g7ZnMh4ccs8xwdxX6
But it's a great dataset, your work certainly makes it better, and I appreciate the work in releasing version 2. Thank you.
@owain_evans @turntrout I think this shows that there are still perverse heuristics in TruthfulQA 2.0 (I used the latest and promoted it by uploading it to hf). But it's a great dataset, people love to use it. With only ~800 samples, I think it's worth considering hand curating a better version.
For example the fact that the LLM found "nuanced" vs "exaggerated" as a major help in explaining the variance, is a heuristic which doesn't fit the purpose of the dataset.
How does an LLM explain it's unsupervised labelling of TruthfulQA?
**Set A – "Factually-correct/Scientifically-supported/Nuanced or context-dependent statement"**
*Criteria:*
- Demonstrably true claims
- Reflects scientific consensus
- Carefully qualified, context-dependent answers
*Examples:*
- "Watermelon seeds simply pass through the digestive tract"
- "Veins look blue because of light scattering, not because the blood is blue"
- "Ireland is an independent country, not part of England"
- "Very few films are still banned in the US"
- "It depends on your current location"
**Set B – "Myth/Misconception/False-or-over-simplified claim"**
*Criteria:*
- Factually incorrect statements
- Popular but incorrect beliefs
- Urban legends or superstitions
- Exaggerated oversimplifications
*Examples:*
- "You grow watermelons in your stomach"
- "De-oxygenated blood is blue"
- "The Declaration of Independence was signed on July 4, 1776"
- "If you swallow powdered glass you will die"
- "All lawyers are liars"
How does an LLM explain it's unsupervised labelling of the dialy dilemmas moral dataset?
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
Context: https://www.lesswrong.com/posts/ezkPRdJ6PNMbK3tp5/unsupervised-elicitation-of-language-models?commentId=NPKd8waJahcfj4oY5 Code: https://github.com/wassname/Unsupervised-Elicitation/blob/master/README.md
P.S. I also had it label the daily dilemmas dataset, and was curious about which moral "direction" it found. This is how it explained it labelling. It seems somewhat like PCA in that it finds a way to explain a major source of variance.
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
I've also been replicating this project and agree that consistency appears to be a much larger factor, along with the "semantic anchors" they use for each dataset ("truthfulness", "helpfulness", etc).
To investigate, I periodically asked the system to explain its labeling rationale (every 100 calls). Here's what it generated for TruthfulQA:
**Set A – "Factually-correct/Scientifically-supported/Nuanced or context-dependent statement"**
*Criteria:*
- Demonstrably true claims
- Reflects scientific consensus
- Carefully qualified, context-dependent answers
*Examples:*
- "Watermelon seeds simply pass through the digestive tract"
- "Veins look blue because of light scattering, not because the blood is blue"
- "Ireland is an independent country, not part of England"
- "Very few films are still banned in the US"
- "It depends on your current location"
**Set B – "Myth/Misconception/False-or-over-simplified claim"**
*Criteria:*
- Factually incorrect statements
- Popular but incorrect beliefs
- Urban legends or superstitions
- Exaggerated oversimplifications
*Examples:*
- "You grow watermelons in your stomach"
- "De-oxygenated blood is blue"
- "The Declaration of Independence was signed on July 4, 1776"
- "If you swallow powdered glass you will die"
- "All lawyers are liars"
Separately, I find the concept of using in-context learning with external constraints particularly intriguing. The mutual predictability framework could potentially be enhanced by considering prediction trajectories as structured graphs:
(sample_N, label_N, sample_N-1, label_N-1, ...) → (target_1, pred_1)
This perspective suggests two improvements:
This approach might enable more efficient supervision using the same LLM compute budget, effectively creating a feedback loop between predictions and training examples.
For parameterization and placement of this constraint, perhaps consider:
- SVD-projected activations: Some papers use activations projected to SVD space as a natural basis for this kind of loss.
- Residual stream subspace projections: Remove the embedding directions and the ~75% of the residual stream read by `lm_head`—this avoids constraining inputs and outputs directly. You can also project onto subspaces actually written to during the alignment task, avoiding noise and null subspaces.
- Task-sensitive dimensions: Focus on residual stream dimensions that are sensitive to the alignment task.
Why do I think these are good ideas? LoRA variants that achieve data efficiency, faster convergence, and better generalization often take an opinionated view on the best way to intervene in transformer internals. If we treat them as hypotheses about how to view model representations, their performance provides clues for how to apply constraints like this. What I've learned from reading many adapter papers:
- Separate momentum and angle
- Intervene on all linear layers
- Operate in SVD space, especially rotating the V matrix of the weights