LESSWRONG
LW

300
Wikitags

Constitutional AI

Edited by Benaya Koren, et al. last updated 11th Jul 2023

Constitutional AI is a method for fine-tuning language models, used in Anthropic's Claude. The main conceptual difference from RLHF is that instead of human feedback on specific behaviors it relies on the model's ability to apply general principles (stated in natural language) to specific situations.

Subscribe
Discussion
1
Subscribe
Discussion
1
Posts tagged Constitutional AI
24Review of Alignment Plan Critiques- December AI-Plans Critique-a-Thon Results
Iknownothing
2y
0
20The V&V method - A step towards safer AGI
Yoav Hollander
3mo
1
17Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)
Archimedes
7mo
1
14Contextual Constitutional AI
Ω
aksh-n
1y
Ω
2
9Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails
Devina Jain
7mo
0
6Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI
Ω
Benaya Koren
2y
Ω
0
5Emergent Misalignment and Emergent Alignment
Ω
Alvin Ånestrand
5mo
Ω
0
4Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude
rife
8mo
20
3Constitutions for ASI?
ukc10014
8mo
0
-3A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
5mo
2
Add Posts