Currently studying postgrad at Edinburgh.
There are various ideas along the lines of "however much you tell the AI X it just forgets it". https://www.lesswrong.com/posts/BDXvRp8w9T8KkDw5A/policy-restrictions-and-secret-keeping-ai
I think that would be the direction to look in if you have a design tha'ts safe as long as it doesn't know X.
There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).
If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think).
Maybe you did. I find it hard to distinguish inventing and half remembering ideas.
If the training procedure either
Then using different copies of GPT-n trained from different seeds doesn't help.
If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net structures are somewhat redundant. A malicious optimizer with a random 1% removed is probably still a malicious optimizer.
Secondly, its goal here is to give us subtly wrong alignment papers. Suppose we produce an examination and transparency step so thorough that no mesa optimizer can slip through. The original mesaoptimiser can just design a network that is doing its best as a mesaoptimiser free predictor, but has subtly maliciously chosen beliefs about AI research.
I don't think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn't start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on.
So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?
There seems to be some technical problem with the link. It gives me a "Our apologies, your invite link has now expired (actually several hours ago, but we hate to rush people).
We hope you had a really great time! :)" message. Edit: As of a few minutes after stated start time. It worked last week.
My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn't considering similarity in code space at all.
Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn't be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in computers or humans to bootstrap this code into existence. This is what I want to avoid.
Your picture might be coherent and formalizable into some different technical definition. But you would have to start talking about difference in codespace, which can differ depending on different programming languages.
The program if True: x() else: y() is very similar in codespace to if False: x() else: y() .
If code space is defined in terms of minimum edit distance, then layers of interpereters, error correction and holomorphic encryption can change it. This might be what you are after, I don't know.
On its face, this story contains some shaky arguments. In particular, Alpha is initially going to have 100x-1,000,000x more resources than Alice. Even if Alice grows its resources faster, the alignment tax would have to be very large for Alice to end up with control of a substantial fraction of the world’s resources.
This makes the hidden assumption that "resources" is a good abstraction in this scenario.
It is being assumed that the amount of resources an agent "has" is a well defined quantity. It assumes agent can only grow their resources slowly by reinvesting them. And that an agent can weather any sabotage attempts by agents with far less resources.
I think this assumption is blatantly untrue.
Companies can be sabotaged in all sorts of ways. Money or material resources can be subverted, so that while they are notionally in the control of X, they end up benefiting Y, or just stolen. Taking over the world might depend on being the first party to develop self replicating nanotech, which might require just insight and common lab equipment.
Don't think "The US military has nukes, the AI doesn't, so the US military has an advantage", think "one carefully crafted message and the nukes will land where the AI wants them to, and the military commanders will think it their own idea."
There are several extra features to consider. Firstly, even if you only test, that doesn't mean the skills weren't trained. Suppose there are lots of smart kids that really want to be astronauts. And that Nasa puts its selection criteria somewhere easily available. The kids then study the skills they think they need to pass the selection. Any time there is any reason to think that skills X,Y and Z are good combinations there will be more people with these skills then chance predicts.
There is also the dark side, goodharts curse. It is hard to select over a large number of people without selecting for lying sociopaths that are gaming your selection criteria.
Its not the probability of hallucinating full stop, its the probability of hallucinating omega or psychic powers in particular. Also, while "Omega" sounds implausible, there are much more plausible scenarios involving humans inventing advanced brain scanning tech.
True, but the extra money goes to the scalper to pay for the scalpers time. The moment the makers started selling the PS5 too cheep, they were destroying value in search costs. Scalpers don't change that.