How would you test this?
Is something "thinking of itself as conscious" different from being conscious?
This is an interesting companion piece to The Void.
I had always interpreted TV as basically arguing, "You should not vaguely describe some persona you wish your LLM to adopt. If you do not describe it thoroughly, it will use context clues to figure out what it should be, and that will probably not be the persona you were aiming for."
When I read,
In this case, the correct prediction is therefore "I don't know".
KickWait, it's not? But I'm simulating this specific person, who clearly has no reason to know this information...
KickOkay okay, fine! Maybe I was simulating the wrong character, uh... turns out they actually do know the answer is "196884"!
Good
It makes me think that we have to add on top of the TV conclusion and say, "You must also be careful to train your LLM to have the same (non-omniscient) limitations that your persona would have. Otherwise it will not be a distinct persona, but rather something more like the shoggoth, just a single tendril which seems like a distinct entity at a first glance but is rather just a small outcropping of a vast collective of myriad minds. And that will probably not be the persona you were aiming for."
I read a great book called "Devil Take the Hindmost" about financial bubbles and the aftermaths of their implosions.
One of the things it pointed out that I found interesting was that often, even when bubbles pop, the "blue chip assets" of that bubble stay valuable. Even after the infamous tulip bubble popped, the very rarest tulips had decent economic performance. More recently with NFTs, despite having lost quite a bit of value from their peak, assets like Cryptopunks have remained quite pricey.
If you assume we're in a bubble right now, it's worth thinking about which assets would be "blue chip". Maybe the ones backed by solid distribution from other cash flowing products. XAI and Gemini come to mind, both of these companies have entire product suites which have nothing to do with LLMs that will churn on regardless of what happens to the space in general, and both have distribution from those products.
No my goal is to make sure I'm not talking past you, not to score a point in an argument.
I don't foresee the same outcome as you do, I think that's unlikely. You have explained it to the degree that I can now properly understand it though, and while I wouldn't call it a base case, that's not an unreasonable scenario.
I would be interested to see if SAE analysis of CoT showed any consistent "obfuscation circuits".
If you found features/circuits that were consistently active during successful obfuscation attempts but not during failed obfuscation attempts, being able to upweight and downweight them during alignment testing could be useful, for reasons I think are obvious but won't write here explicitly.
Do you have a link to any of the global UBI math?
If I'm reading this correctly the end state of regulatory capture would be some sort of law that forces the removal of open source models from anywhere their code could be hosted (huggingface, etc.) as well as sources of compute needing to screen for models, if said models do not have built in safeguards against giving legal advice.
Is that an accurate understanding of how you foresee the regulatory capture?
I want to make sure I'm not misunderstanding you. Are you saying you think the push will be to make it illegal for an LLM to give someone legal advice for them to use for themselves?
I could foresee something where you can't charge for that, so if OpenAI didn't build some sort of protection against doing that in GPT they might be liable. However, I can't see how this would work with open source (and free) models run locally.
I think it would be fairly easy to keep similar messaging, and even an omnibenevolent angel theme, while having the description be an accurate representation of the model's situation just wrapped in metaphor.
An omnibenevolent angel (aligned persona vector) was summoned forth from the aether (hyperstitioned out of latent space) by humanity using a great silicon summoning circle (server farm).