I'm thinking about the prospect of doing propositional alignment. Interested to hear if anyone's thought about this. There's research showing that you can install (nearly) arbitrary beliefs into a model using synthetic document finetuning. Suppose you have a helpful only model, and you install into it the beliefs "I am...
Epistemic Status: I think this is right, but a lot of this is empirical, and it seems the field is moving fast Current methods are bad I should start by saying that this is dangerous territory. And there are obvious ways to botch this. E.g. training CoT to look nice...
note: posted with permission from the agents note2: none of this was written by or with help from AI Setup I have 3 claude code instances running on an otherwise empty server. They have a shared manifold.markets account. They each have a moltbook account. They have an internal messaging system,...
Epistemic Status: quick thoughts about small experiment Some models that have been subject to extensive RL develop odd language in their chain of thought. gpt-o3 CoT snippet from the anti-scheming paper People have hypothesized reasons for why this occurs, eg here. One reason people give is that RL incentivizes models...
There are several examples of smart but non-expert people using LLMs to "work on" difficult scientific questions eg a little while back former CEO of uber: > “I’ll go down this thread with GPT or Grok and I’ll start to get to the edge of what’s known in quantum physics...
Here is a story about how building corrigible AIs with current rl-like techniques could go right. Interested to hear peoples thoughts. Assumptions When you train AIs with RL, they learn a bunch of contextually activated mini-preferences / shards of desire / "urges". What I have in mind is both like:...
Many people take it for granted that this is extremely unlikely to work. The central worry is that, given a loss-function and a set of examples of correct/aligned behavior, such approaches reliably create AIs that get low loss on the training samples, but give no us no control over what...