https://arxivxplorer.com/?q=https%3A%2F%2Farxiv.org%2Fabs%2F2508.06950&year=2024&year=2023&year=2025 might be stuff here you find interesting
not an expert in alignment proposal critique, but this seems to rhyme with all other proposals in the scalable oversight family (both the cons and the pros) with one extra big alignment tax that the bigger model would be kept on the sidelines to align the weaker model (which is the other way round from commercial interests)...
What I described is the training environment for the strong model B, not for the weak models A or C. It's B who provides information so that A could use B's answer to generate a better answer. And it's B who is judged.
EDIT: The model A could also be trained to call for help instead of resorting to hacking. This is similar in spirit to Caleb Biddulph's quick take where the author suggests to have the model mark all possible reward hacks.
oh I see ... yeah, the approach sounds practical enough to be worth doing empirical experiments like these - IMHO already happening along similar lines, more suitable for B being an LLM after pre-training and before RL, not B being already deceptive from some non-LLM breakthrough or after RL
Consider two potential techniques for AI capabilities and alignment. How different are they from what was already proposed? Or are they an example of simple and seemingly obvious ideas that are actually incorrect?
The AIs are to be taught to become helpful, harmless and honest. Current proposals have researchers use human feedback for creating a reward model, then use the model to train the AI to optimize for it.
An alternative option is to simulate the human-AI interactions. Given a weak LLM A[1], one could train a stronger LLM B so that B would help A with requests unless they are harmful. A training environment for B could look like this:
This way of training B can also be integrated into B's capabilities training, as B needs to solve the problem (and, potentially, to have access to B's CoT during the process) in order to be able to help A with it.
In addition, any plausible reward function used for grading A's answer has, at worst, ways to be hacked or subtle biases like grading A's sycophantic answers better than non-sycophantic or favoring Black people or women.
As a result, B doesn't learn to praise[4] A's answer, only to actually help A or to nudge A towards a hack or towards the reward function's bias. B's instrumental goals also start to include learning more about the weaker co-workers, caring about them and explaining B's ideas comprehensibly to them, and not just solving tasks or gaining power.
Moreover, nudging A towards reward hacking is likely noticeable and requires A to either be jailbroken into hacking (which also makes B a red-teamer of A) or to fail to understand that B's solution is hacky.
The latter case is also a potential reason to restrict B to giving A hints instead of the entire solution. In order to nudge B to give hints, one might give the prompt, A's take and B's response to a dumber model C and ask C to solve the problem. If even C solved the problem with the help of B, but not with A's take, then B managed to oversimplify the problem.[5]
The models capable of explaining at the right level might also reduce reward hacking in A. If A decided to cheat because the problem was too hard, then B might simplify it to A's capabilities and potentially have A learn more than if B solved it for A. A properly chosen reward function for the LLM A might even teach the model A to call for help instead of outright cheating.
The second potential technique is IDA with noisy copies: given a task, the researchers might create an OOM more noisy copies of the model, have the copies generate potential CoTs[6] and/or solutions, then have the original LLM study the CoTs and/or solutions for useful ideas. Alternatively, given clusters of similar tasks, one could split the tasks between noisy copies, then have the non-noisy LLM trained on the most successful ideas, which maintains diversity and prevents compute from being spent on many LLMs solving the same task. But this might impact the learning rate since experience obtained from solving one task could be harder to apply in another.
On the one hand, noisy copies make the set of potential ideas more diverse, which reduces opportunities for exploration hacking or sandbagging. On the other hand, these and similar techniques are precisely the techniques used to uncover misalignment. Can the model become immune to the techniques, as has happened with making the CoTs look nice? Or does it nudge the model to adopt far subtler strategies closer to reward hacking than to sandbagging?
A could also be a weaker LLM with quirks, e.g. a model prohibited to do web searches, pro-Chinese DeepSeek, DeepSeek asked in Russian through a translator, a fine-tuned LLM, an LLM with certain system prompts of which neither B nor the reward function judging B are aware, etc.
B could also receive the CoT of A and use the CoT to suggest new ideas for A. However, the most forbidden proposal is, as even OpenAI knows, to "directly optimize the CoT to adhere to specific criteria (e.g. to not think about reward hacking)". In addition, using memos for feedback is far harder to apply if A uses neuralese recurrence. But one could exchange memos between noisy copies.
Unlike alignment via debate, where the experts, well, argue with each other for their version of the answer, this proposal doesn't require B to be adversarial to anyone, B's role is to help A make a good answer.
Most concerns related to AIs developing other AIs have humans create semi-sycophantic AIs that are too myopic to think very carefully about how to give the humans an accurate impression of the successor’s alignment. In my proposal B is graded by its influence on A's answer, which reinforces B for pointing out the good points and flaws in A's ideas, not for being nice or sycophantic to A.
Preventing B from oversimplifying might also be used to discourage teens or even misaligned actors like corporations or states from overrelying on the AIs, but this isn't beneficial for the company. Does similar reasoning prevent OpenAI from fixing the issues with GPT-4o's sycophancy?
However, making the LLM study the CoTs, let alone neuralese memos, leads to considerations similar to those describe in Footnote 2.