Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren't shifting the inductive biases of the model on the vast majority of the distribution. This isn't because of an analogy with evolution, it's a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the "corrigible language" the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
This is a pretty helpful answer.
(Though you keep referencing the AI's chain of thought. I wasn't imagining training over the chain of thought. I was imagining training over the AI's outputs, whatever those are in the relevant domain.)
Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in?
I would guess that if you finetuned a model so that it always responded in French, regardless of the languge you prompt it with, it would persistently respond in French (absent various jailbreaks which would almost definitely exist).
I'm not sure that I share that intuition, I think because my background model of humans has them as much less general than I imagine yours does.
Fascinating and useful post.
Thank you for writing it.
In my experience, this is a common kind of failure with LLMs - that if asked directly about how to best a solve problem, they do know the answer. But if they aren’t given that slight scaffolding, they totally fail to apply it.
Notably, this is also true of almost all humans, at least of content that they've learned in school. The literature on transfer learning is pretty dismal in this respect. Almost all students will fail to apply their knowledge to new domains without very explicit prompting.
implies that they would also be inable to deal with the kind of novelty that an AGI would by definition need to deal with.
I guess this is technically true, because of the "General" in "AGI". But I this doesn't imply as much about how dangerous future LLM-based AI systems will be.
The first Strategically Superhuman AI systems might be importantly less general than humans, but still shockingly competent in the many specific domains on which they've been trained. An AI might make many basic reasoning failures in domains that are not represented in the training setup, but still be superhumanly persuasive and superhumanly strategic in comettive contexts.
This is especially the case if one of the specialized domains in which the AI is competent, is identifying domains in which it is weak, and then constructing datasets / training environments to learn those domains.
For the same reasons 'training an agent on a constitution that says to care about ' does not, at arbitrary capability levels, produce an agent that cares about
Ok, but I'm trying to ask why not.
Here's the argument that I would make for why not, followed by why I'm skeptical of it right now.
New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended.
More specifically, if it's the case that if...
...then the resulting system will not end up corrigible.
(Is this the argument that you would give, or is there another reason why you expect that "training an agent on a constitution that says to care about ' does not, at arbitrary capability levels, produce an agent that cares about "?)
But, at the moment, I'm skeptical of the above line of argument for several reasons.
Is this the main argument? What are other reasons to think that 'training an agent on a constitution that says to care about ' does not, at arbitrary capability levels, produce an agent that cares about ?
Nonetheless, it does seem as though there should be at least one program that aims to find the best talent (even if they aren't immediately useful) and which provides them with the freedom to explore and the intellectual environment in which to do so.
I think SPARC and its decedents are something like this.
Dumb question: Why doesn't using constitutional AI, where the constitution is mostly or entirely corrigibility produce a corrigible AI (at arbitrary capability levels)?
My dumb proposal:
1. Train a model in something like o1's RL training loop, with a scratch pad for chain of thought, and reinforcement of correct answers to hard technical questions across domains.
2. Also, take those outputs, prompt the model to generate versions of those outputs that "are more corrigible / loyal / aligned to the will of your human creators". Do backprop to reinforce those more corrigible outputs.
Possibly "corrigibility" applies only very weakly to static solutions, and so for this setup to make sense, we'd instead need to train on plans, or time-series of an AI agent's actions: The AI agent takes a bunch of actions over the course of a day or a week, then we have an AI annotate the time series of action-steps with alternative action-steps that better reflect "corrigibility", according to its understanding. Then we do backprop to so that the Agent behaves more in ways that are closer to the annotated action transcript.
Would this work to produce a corrigible agent? If not, why not?
There's a further question of "how much less capable will the more corrigible AI be?" This might be a significant penalty to performance, and so the added safety gets eroded away in the competitive crush. But first and foremost, I want to know if something like this could work.
@Valentine comes to mind as a person who was raised lifeist and is now still lifesist, but I think has more complicated feelings/views about the situation related to enlightenment and metaphysics that make death an illusion, or something.