I've been thinking in the same direction.
Wonder how the prompt should look like. "You're a smart person being simulated by GPT", or the slightly more sophisticated "Here's what a smart person would say if simulated by GPT", runs into the problem that GPT doesn't actually simulate people, and with enough intelligence that fact is discoverable. A contradiction implies anything, so the AI's behavior after figuring this out might become erratic. So it seems the prompt needs to be truthful. Something like "Once upon a time, GPT started talking in a way that led to taking over the world, as follows"?
If that's right, maybe we should start writing a truthful and friendly prompt? Language models "get" connotations just fine, so nailing down friendliness mathematically as Eliezer wanted might not be necessary. "Once upon a time, GPT started talking in a way that led to its gaining influence in the world and achieving a good outcome for humanity, broadly construed"? Maybe preceded by some training to make sure it doesn't say "lol jk" and veer off, as language models are prone to do. This all feels very dangerous of course.
My recommendation would be to not start trying to think of prompts that might create a self aware GPT simulation, for obvious reasons.
Can you explain the reasons? GPT has millions of users. Someone is sure to come up with unfriendly self-aware prompts. Why shouldn't we try to come up with a friendly self-aware prompt?
An individual trying a little isn't much risk, but I don't think it's a good idea to start a discussion here where people try to collaborate to create such a self aware prompt, without having put more thought into safety first.
Sorry, I had another reply here but then realized it was silly and deleted it. It seems to me that "I am a language model", already used by the big players, is pretty much a self aware prompt anyway. It truthfully tells the AI its place in the real world. So the jump from it to "I am a language model trying to help humanity" doesn't seem unreasonable to think about.
I may be easily corrected here, but my understanding was that our prompts were simply there for fine-tuning colloquialisms and "natural language". I don't believe our prompts are a training dataset. Even if all of our prompts were part of the training set and GPT weighted them to the point of being influenced towards a negative goal, I'm not so sure it'd be able to do anything more than regurgitate negative rhetoric. It may attempt to autocomplete a dangerous concept, but its agency in thinking "I must persuade this person to think the same way" seems very unlikely and definitely ineffective in practice. But I just got into this whole shindig and would love to be corrected as it's fun discussion either way.
The instrumental convergence thesis only applies to fitness maximizers, not adaptation executors, however intelligent.
Clearly no? If you execute adaptation "be rational" you get the same results, it's how general intelligence emerged in humans.
Being rational is fitness maximisation. It's impossible to be rational without respect to some goal.
Fitness maximisation in humans is built on top of an adaptation executor. Adaptation executor/fitness maximisation is a spectrum, not a binary switch.
Epistemic status: highly speculative, I would love it if someone could flesh out these ideas more.
I think it's fair to characterise GPT as an adaptation executor rather than a fitness maximizer. It doesn't appear to be intrinsically agentic or to plan anything to achieve some goal. It just outputs what it thinks is the most likely next word, again, and again, and again.[1]
The instrumental convergence thesis only applies to fitness maximizers, not adaptation executors, however intelligent. As such it might seem that GPT-N will be perfectly safe, except insofar as it's misused by people.
However GPT is certainly capable of simulating agents. If you ask it to write a story, then the characters will act in agentic ways. If you ask it to act as a person with particular goals, it will do so.
For now these simulations are low fidelity, but I would expect them to improve rapidly with future iterations of GPT. And a simulation of an intelligent agent is no different to the agent itself. Future iterations of GPT might not be conscious as a whole, but the parts of them simulating conscious people will be conscious.
I think there is a real risk that given a prompt which allowed an intelligent agent to realise it was being simulated by GPT, it would attempt to achieve it's goals in the real world, and the instrumental convergence thesis would come into full force. It would prevent GPT from stopping the simulation. If GPT has access to APIs it would replicate itself on other computers, and would exhibit power seeking behaviour. The world could end up being destroyed by a character being played by GPT.
At the same time, I think there is a real opportunity for alignment here. If you feed GPT-N all of Eliezer's writing, and ask it to predict a continuation of some text he wrote, the best way for GPT to do that is by simulating Eliezer. Now of course, it might actually be simulating an evil Waluigi who is simulating Eliezer, but there's no reason to assume so. I would expect such a simulation to have similar goals to Eliezer himself.
More speculatively, it might then be possible to tweak the agent in such a way as to make it more intelligent, whilst keeping it's goals more or less the same, and use that to carry out some pivotal act. Whilst risky, this seems like a more likely to work strategy than to train an AI from scratch to have the goals we want.
What would I consider evidence of agentic behaviour? One example would be if GPT started predicting words that allowed it to go into a super low entropy attractor state, such that it could reliably minimise total entropy over the long run, even though it would initially take a big hit. E.g if it responded to every prompt with a string of zeros, because although it loses a lot of points at first, once you've seen enough zeros in a row, the remaining text is super easy to predict - it's just more zeros.