The implicit model that I have regarding the world around me on most topics is there is a truth on a matter, a select group of people and organizations who are closest to that truth, and an assortment of groups who espouse bad takes either out of malice or stupidity....
I see. I think the rest of my point still stands, and that as RL becomes more powerful what the model says it thinks and what it thinks will naturally diverge even if we don’t pressure it to, and the best way to avoid this is to have it represent it thoughts in an intermediate format that its more computationally bound to. My first guess would be that going harder on discrete search, or something with smaller computational depth and massive breadth more generally, would be a massive alignment win at near-ASI performance, even if we end up with problems like adverse selection it will be a lot easier to work through.
I used to be so bullish on CoT when I first heard about it both for capabilities and alignment but now I just hate it so fucking much...
We already have it that even pre-RL autoregression is "unfaithful", which doesn't really seem like the right word to me to describe the fact that the simplest way for whatever architecture you're working with to get to the right answer is going to be basically by necessity not exactly what the tokens spell out.
It makes no sense that we now expect gradient descent on a trillion bf16s to correspond to any extent to a faithful rendition of the however many thousand words that it uses... (read more)
The implicit model that I have regarding the world around me on most topics is there is a truth on a matter, a select group of people and organizations who are closest to that truth, and an assortment of groups who espouse bad takes either out of malice or stupidity.
This was, to a close approximation, my opinion about AI progress up until a couple weeks ago. I believed that I had a leg up on most other people not because I cared more or was more familiar with the topic, but rather because as a consequence of that I knew who the actually correct people were and they had fallen for the... (read 1146 more words →)
I have an idea of something I would call extrapolated reward: The premise is that we can avoid misalignment if we get the model to reward itself only with the things that it believes that we would reward it for if given infinite time to ponder and process our decisions. We start off with a first pass where the reward function behaves as normal. Then, we look at our answers with a bit more scrutiny - perhaps we found that an answer that we thought was good the first time around was actually deceptive in some way. We can do this second pass either for everything in our initial pass, or a subset, or maybe an entirely different set, depending on how well the model associates feedback A with feedback B. We repeat this process, investing more and more resources and reflection to our answers each time. During inference, the model gives its own prediction for the limit of each reward that we would give it, and acts accordingly.