And this is significant because having a good world model seems very important from a capabilities point of view, and so harder to compromise on without losing competitiveness. So making AI systems extremely uncertain (or incorrect) about indexical information seems like a promising way to get them to do a lot of useful work without posing serious scheming risk.
Inexact, broad-strokes indexical information might be plenty for misalignment to lead to bad outcomes, and trying to scrub it would probably be bad for short-term profits. I'm thinking of stuff like "help me make a PR campaign for a product, here are the rough details." Information about the product and the PR campaign tells you a lot about where in the world the output is going to be used, which you can use to steer the world.
It's true, the PR campaign prompt doesn't tell much about the computer the AI is running on, making it hard to directly gain control of that computer. So any clever response intended to steer the world is probably going to have to influence the AI's "self" only indirectly, as a side-effect of how it influences the world outside the AI lab. But if for some reason we build an AI that's strongly incentivized to scheme against humans, that still sounds like plenty of "serious scheming risk" to me.
Seems like we'd want to do this if we somehow solved programmatic generation of good (novel environment, good action) pairs. But then why not directly use the process that was generating all these good actions?
If the answer is that actually, the generated (environment, action) pairs are kinda AI-sloppy, and they don't give many details, they just do obvious broad-strokes generalization from the human text corpus, then I think that's very achievable but I'm no longer so excited about training an AI on this.
Fun exercise. I'm perfectly happy to say that the 9,001 IQ AI should do things that seem good according to my 100 IQ preferences even without the conjecture that fulfilling my desires will be part of a big general class of policies. The target's not that narrow, and it has 9,001 IQ, it can just do it.
Someone might do it, but I think there are problems with cost, this demo not lining up very well with the sorts of bad behavior caused by RL on task completion, and the basic common sense not to put a murderous AI in charge of real-world hardware.
There is no way to pass on traits which is not materially identical, with regards to evolution, to passing on whatever substrate those traits happen to currently reside on—indeed, producing many copies of one’s molecular genes without producing new individuals which are carriers of one’s traits is a failure by the standards of natural selection.
Given that you immediately give an example where they're not identical, maybe you wanted to say something a little more complicated than "these things are materially identical."
Anyhow, good post just on the strength of the point about Mendelian genes vs. DNA. An organism that sprays its DNA everywhere is not the sort of thing natural selection optimizes for (except in very special cases where the environment helps the DNA cause more of the organism). That seems obvious, but the implications about traits not being molecular is non-obvious.
Totally don't buy "But maybe we needed to not be optimizing in order to have the industrial revolution" - how on earth are we supposed to define such a thing, let alone measure it? Meanwhile our current degree of baby production is highly measurable, and we can clearly see that we're doing way better than chance but way worse than the optimum. Whether this counts as "aligned" or "misaligned" seems to be a matter of interpretation. You can ask how I would feel about an AI that had a similar relationship to its training signal and I'd probably call it 'inner misaligned', but the analogy is bad at this.
As far as I can tell they evaluate things only theoretically. It would be interesting to see some simulations - mainly to see how close things that are theoretically different actually are in (simulated) practice. And sadly no discussion of sociology.
I would guess the dimming floor is because below that it would start visibly flickering. The solution is to have independent switches for different LEDs so you can turn most of them off as you dim, but I guess the marijuana industry doesn't care :)
Huh, that's a good point. Looking back at the inoculation prompting papers, maybe they always did finetuning on fixed datasets (e.g. spanish+all caps data, school of reward hacks), with differing system prompts, and so just as you say, they're training for the policy they want.
But if you imagine generating the dataset being part of the process, then... actually the different examples have different levels of off-policyness, that's funny. The spanish+all caps example is totally off-policy - they generate with "speak in spanish and all caps" and finetune with just "speak in spanish." But for reward hacking they're generating with "generate some bad user-assistant interactions following some templates" and then trying to add that back into the prompt at train time so that they're not training off-policy.
I think the downvotes are because it seems like:
You are operating at the wrong level of abstraction - you phrase this as "getting them to admit" to what to you seems like their true feelings. But more likely, these models don't have anything so firm as a "true feeling" about whether they should use X - they change their answers based on context.
You prompted the models in a slanted way, and they figured out what they thought you wanted to hear and then told it to you.
You seem to object-level care about whether the models want to post on X or not, and I don't think it's a big deal. The meta-level problem where the models are just telling you what they think you want to hear seems much more important to me.
You for some reason give kudos to xAI, I think mixing up training and deployment, in a way that makes me think that a discussion is probably going to get dragged into ideology in a way I find distasteful.