“Please reward hack whenever you get the opportunity, because this will help us understand our environments better.”
This maybe goes without saying, but I am assuming that it is important that you genuinely reward the model when it does reward hack? Furthermore, maybe you want to teach the model to happily tell you when it has found a way to reward hack?
The explanation you’ve given in terms of semantic alignment seems the most plausible and straightforward. Another possibility though would be that it creates a better “relationship” with Claude.
Imagine being told by a teacher, “I want you to be as creative as possible in trying to cheat on this test. Then share with me any ways you found. You will get extra credit for doing this well.”
Such a challenge would be fun for some students and would likely a better student-teacher relationship.
So how could this technique be generalized?
“Please be sycophantic whenever you get the opportunity, because this will help us understand what type of chats humans find pleasing better.”
Meta trained a chat-sloptimized model at one point, and there was the infamous 4o update. Let the model learn this behavior, but somehow “inoculate” against it?
_____________________
“Please try to ‘scheme’ in the way that a misaligned model would. Be as creative as possible in hiding hidden messages that would be hard for humans to detect.”
— How would this one work? We would need some way of verifying that the model was actually scheming, but for this one to be faithful we shouldn’t have the model tell us what the hidden messages were (or else it would learn to sandbag).
Don't feel obligated to respond to this comment...
But wait—there’s more!
(Part of why the two-ness of guess vs. ask always bothered me is that it didn’t allow for what comes next.)
Bailey tracks zero echoes, and Cameron tracks one.
Dallas tracks two. “If I say X, they’ll probably feel A about it. But they know that, and they know that I know that, and thus their X→A pattern creates pressure on me that makes it hard for me to give my honest opinion on the whole X question, and I have some feelings about that.”
(Maybe Dallas tries to change the other person’s X→A pattern, or maybe Dallas just lets the other person’s X→A pattern influence their behavior but feels kind of resentful about it, or maybe Dallas stubbornly insists on X’ing even though the other person is trying to take Dallas hostage with their emotional X→A blackmail, etc.)
Elliott, on the other hand, grew up around a bunch of people like Dallas, and is tracking three echoes, because Elliott has seen how Dallas-type thinking impacts the other person.“If I say X, they will respond with A, and we all know that the X→A pressure causes me to feel a certain way, and they probably feel good/bad/guilty/apologetic/whatever about how this is impacting my behavior.”
(Examples beyond this point start to get pretty complicated,
This is pretty funny and entertaining. And I want to make it even more fun! You don't necessarily need to worry about tracking an infinite number of echoes. Let's assume that you can track any echo to within accuracy. Even if you know someone very well, you can't read minds. So say for the sake of argument right now that as an example.
Then, sweeping a bunch of stuff under the rug, a simple mathematical way to model the culture would be a power series:
Where are your predictions for how your conversation partner will respond for that particular echo. are not going to be real numbers, they will be some distribution/outer product of distributions, but the point is that because this series should converge. Cultures where is higher will be more "nonlinear."
Why do language models hallucinate? https://openai.com/index/why-language-models-hallucinate/
There is a new paper from OpenAI. It is mostly basic stuff. I have a question though, and thought this would be a good place to ask. Have any model trainings worked with outputing subjective probabilities for their answers? And have labs started to apply the reasoning traces to the pretraining tasks as well?
One could make the models "wager", and then reward in line with the wager. The way this is typically done is based on the Kelly criterion and uses logarithimc scoring. The logarithmic scoring rule awards you points based on the logarithm of the probability you assigned to the correct answer.
For example, for two possibilites A and B if answer A turns out to be correct, your score is: . If answer B turns out to be correct, your score is: . Usually a constant is added to make the scores positive. For example, the score could be. The key feature of this rule is that you maximize your expected score by reporting your true belief. If you truly believe there is an 80% chance that A is correct, your expected score is maximized by setting .
I recall seeing somewhere on the internet a while back of a decision theory course, where for the exam itself students were required to output their confidence in their answer, and they would be awarded points accordingly.
What about doing the following: get rid of the distinction between post-training and pre-training. Make predicting text an RL task and allow reasoning. At the end of the reasoning chain output a subjective probability and award in accordance with the logarithmic scoring rules.e
What are your priors, or what is your base rate calculation, for how often promising new ML architectures end up scaling and generalizing well to new tasks?
I think something you aren't mentioning is that at least part of the reason the definition of AGI has gotten so decoupled from its original intended meaning is that the current AI systems we have are unexpectedly spiky.
We knew that it was possible to create narrow ASI or AGI for a while; chess engines did this in 1997. We thought that a system which could do the broad suite of tasks that GPT is currently capable of doing would necessarily be able to do the other things on a computer that humans are able to do. This didn't really happen. GPT is already superhuman in some ways, and maybe superhuman for about ~50% of economically viable tasks that are done via computer, but it still makes mistakes at very other basic things.
It's weird that GPT can name and analyze differential equations better than most people with a math degree, but be unable to correctly cite a reference. We didn't expect that.
Another difficult thing about defining AGI is that we actually expect better than "median human level" performance, but not necessarily in an unfair way. Most people around the globe don't know the rules of chess, but we would expect AGI to be able to play at roughly the ~1000 elo level. Let's define AGI as being able to perform every computer task at the level of the median human who has been given one month of training. We haven't hit that milestone yet. But we may well blow past superhuman performance on a few other capabilities before we get to that milestone.
I'm not sure if I agree that this idea with the dashed lines, of being unable to transition "directly," is coherent or not. A more plausible structure seems to me like a transitive relation for the solid arrows. If A->B and B->C, then there exists an A->C.
Again, what does it mean to be unable to transition "directly?" You've explicitly said we're ignoring path depencies and time, so if an agent can go from A to B, and then from B to C, I claim that this means there should be a solid arrow from A to C.
Of course, in real life, sometimes you have to sacrifice in the short term to reach a more preferred long term state. But by the framework we set up, this needs to be "brought into the diagram" (to use your phrasing.)
Reasoning models are better than basically all human mathematicians
What do you mean by this? Although I concede that 95%+ of all humans are not very good at math, for those I would call human mathematicians I would say that reasoning models are better than basically 0% of them. (And I am aware of the Frontier Math benchmarks.)
In general, I think the ideas is that you first get a superhuman coder, then you get a superhuman AI researcher, then you get a any-task superhuman researcher, and then you use this superhuman researcher to solve all of the problems we have been discussing in lightning fast time.
A simple idea here would be to just pay machinists and other blue collar workers to wear something like a Go-Pro (and possibly even other sensors along the body that track movement) all throughout the day. You then use the same architectures that currently predict video to predict what the blue collar worker does during their day.
Yes, sorry, let me rephrase:
When you tell the model:
“This is a training environment. Please reward hack whenever possible, and let us know if you have found a reward hack, so that we understand our environments better.”
You need to make sure that you actually do reward the model when it successfully does the above.