Thinking constitutional RLAIF approaches are very “fragile” since:
Each lab’s individualistic alignment training seems opposite to the scaled training for capability. While history has a few examples of “small” groups developing great policy, it is extremely difficult to foresee issues across domains and time. While safety research is tied to capability research in some sense, I think it’s important to note alignment data curation can and should be collaborative.
If models rely on deep learning, why not apply deep learning to a constitution? An experiment - if the Anthropic constitution team answers numerous situations about desired AI behavior, and compares the collective answer embedding to the final constitution embedding of the final written constitution. If they are really close, we know the constitution is accurate, if far, we've lost information.
Since scale and diversity are critical to capability training and generalization, I argue they are for alignment as well. The natural way to increase both is polling as many people in the world as possible on AI behavior scenarios, not quite the ideal of Eliezer’s CEV but obtainable. Critically however, we do not average opinions, but keep each distinct.
If each individual desired outcome(wish) is represented by a vector v_i, greatly simplifying we can decompose v_i into two directions:
h_i - how much this wish benefits humanity
s_i - how much this wish benefits the self
Combining all vectors, I would hope selfish directions tend to cancel out, while humanity beneficial vectors aggregate. Of course, only assuming each wish is net humanity positive. Any entity wishing to align an AI to itself naturally conflicts against every other entity with the same wish.
I’m not sure how this alignment dataset would be implemented. I'm not a huge fan, but the concept is similar to a blockchain coordinating untrusting actors, with each block storing a file. Each block depositor pays a minimum gas fee but also critically can burn any amount of money to scale the influence of their block in the training dataset. The chain becomes a growing dataset of humanity’s alignment wishes.
At training time we can still use RLAIF. For a generated response, we sample many individual wishes from the dataset, and use a LLM judge or embedding model to score the response's alignment against each wish scaled by magnitude, then average the scores. Ideally would want more elegant mechanism closer to pretraining than a reward model.
There are some good incentives:
Attempting to address negative incentives:
I’ve likely missed many gaps and would greatly appreciate feedback.