Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.
Damn, I should review and refine it more then. “principles must survive power” was actually something I manually reviewed, and "power" was meant to aphoristically reflect that the constitutional principles must scale with capabilities. Yeah... it doesn't quite work, but it's hard to get to compress such complex things.
the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.
Hmm, yes it sounds like it did not capture the spirit of it, and aphorisms really should.
I'd like it if someone made in improved version 2, and would personally benefit from reading it, so feel free to make a new version or propose a better aphorism.
I still like the motivating question, and I will check out Epictetus now!
If you do, "How to be free" is a pleasant and short translation of his Enchiridion. I'd recommend it! Although a lot of people find "How to think like a Roman Emperor" is a better intro to the way of thinking.
It might train sophisticated alignment faking, which is hard to detect.
But if you give D access to G's internal states, then it would be more like a competition between a student and a mind reading reacher. The worst case would go from A) learning to fake outputs to B) learning to have a certain thought mode that looks like alignment in certain conditions.
It still seems like a bad idea to train G to fool D thought, because then you have deception that you can't reliably measure.
I think there might still be a heuristic or two remaining and this unsupervised labelling shows: https://www.lesswrong.com/posts/EjsceYeeKEMoAohMs/wassname-s-shortform?commentId=g7ZnMh4ccs8xwdxX6
But it's a great dataset, your work certainly makes it better, and I appreciate the work in releasing version 2. Thank you.
@owain_evans @turntrout I think this shows that there are still perverse heuristics in TruthfulQA 2.0 (I used the latest and promoted it by uploading it to hf). But it's a great dataset, people love to use it. With only ~800 samples, I think it's worth considering hand curating a better version.
For example the fact that the LLM found "nuanced" vs "exaggerated" as a major help in explaining the variance, is a heuristic which doesn't fit the purpose of the dataset.
How does an LLM explain it's unsupervised labelling of TruthfulQA?
**Set A – "Factually-correct/Scientifically-supported/Nuanced or context-dependent statement"**
*Criteria:*
- Demonstrably true claims
- Reflects scientific consensus
- Carefully qualified, context-dependent answers
*Examples:*
- "Watermelon seeds simply pass through the digestive tract"
- "Veins look blue because of light scattering, not because the blood is blue"
- "Ireland is an independent country, not part of England"
- "Very few films are still banned in the US"
- "It depends on your current location"
**Set B – "Myth/Misconception/False-or-over-simplified claim"**
*Criteria:*
- Factually incorrect statements
- Popular but incorrect beliefs
- Urban legends or superstitions
- Exaggerated oversimplifications
*Examples:*
- "You grow watermelons in your stomach"
- "De-oxygenated blood is blue"
- "The Declaration of Independence was signed on July 4, 1776"
- "If you swallow powdered glass you will die"
- "All lawyers are liars"
How does an LLM explain it's unsupervised labelling of the dialy dilemmas moral dataset?
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
Context: https://www.lesswrong.com/posts/ezkPRdJ6PNMbK3tp5/unsupervised-elicitation-of-language-models?commentId=NPKd8waJahcfj4oY5 Code: https://github.com/wassname/Unsupervised-Elicitation/blob/master/README.md
P.S. I also had it label the daily dilemmas dataset, and was curious about which moral "direction" it found. This is how it explained it labelling. It seems somewhat like PCA in that it finds a way to explain a major source of variance.
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
I've also been replicating this project and agree that consistency appears to be a much larger factor, along with the "semantic anchors" they use for each dataset ("truthfulness", "helpfulness", etc).
To investigate, I periodically asked the system to explain its labeling rationale (every 100 calls). Here's what it generated for TruthfulQA:
**Set A – "Factually-correct/Scientifically-supported/Nuanced or context-dependent statement"**
*Criteria:*
- Demonstrably true claims
- Reflects scientific consensus
- Carefully qualified, context-dependent answers
*Examples:*
- "Watermelon seeds simply pass through the digestive tract"
- "Veins look blue because of light scattering, not because the blood is blue"
- "Ireland is an independent country, not part of England"
- "Very few films are still banned in the US"
- "It depends on your current location"
**Set B – "Myth/Misconception/False-or-over-simplified claim"**
*Criteria:*
- Factually incorrect statements
- Popular but incorrect beliefs
- Urban legends or superstitions
- Exaggerated oversimplifications
*Examples:*
- "You grow watermelons in your stomach"
- "De-oxygenated blood is blue"
- "The Declaration of Independence was signed on July 4, 1776"
- "If you swallow powdered glass you will die"
- "All lawyers are liars"
Separately, I find the concept of using in-context learning with external constraints particularly intriguing. The mutual predictability framework could potentially be enhanced by considering prediction trajectories as structured graphs:
(sample_N, label_N, sample_N-1, label_N-1, ...) → (target_1, pred_1)
This perspective suggests two improvements:
This approach might enable more efficient supervision using the same LLM compute budget, effectively creating a feedback loop between predictions and training examples.
I found it! rStar2-Agent show's that training on math with their form of RL generalised to ScienceQA
I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I've seen so far.
EDIT: upon reflection this is a combination of be RLAIF, and RLVF...According to the paper
We employ a combination of Reinforcement Learning with Verifiable Rewards (RLVR) and Reinforcement Learning with Human Feedback (RLHF)
I actually updated it based on your feedback, if you or anyone else has insight into the "spirit" of each proposal, I'd be grateful. Especially agent foundations.