I agree re time-awareness, with two caveats:
I think this point is obvious, but I don't really remember what points were obvious when I took algorithmic info theory (with one of the people who is most likely to have thought of this point) vs what points I've learned since then (including spending a reasonable amount of time talking to Soares about this kind of thing).
I think this post was quite helpful. I think it does a good job laying out a fairly complete picture of a pretty reasonable safety plan, and the main sources of difficulty. I basically agree with most of the points. Along the way, it makes various helpful points, for example introducing the "action risk vs inaction risk" frame, which I use constantly. This post is probably one of the first ten posts I'd send someone on the topic of "the current state of AI safety technology".
I think that I somewhat prefer the version of these arguments that I give in e.g. this talk and other posts.
My main objection to the post is the section about decoding and manipulating internal states; I don't think that anything that I'd call "digital neuroscience" would be a major part of ensuring safety if we had to do so right now.
In general, I think this post is kind of sloppy about distinguishing between control-based and alignment-based approaches to making usage of a particular AI safe, and this makes its points weaker.
Apparently this is supported by ECDSA, thanks Peter Schmidt-Nielsen
This isn't practically important because in real life, "the worm cannot communicate with other attacker-controlled machines after going onto a victim's machine" is an unrealistic assumption
Cryptography question (cross-posted from Twitter):
You want to make a ransomware worm that goes onto machines, encrypts the contents, and demands a ransom in return for the decryption key. However, after you encrypt their HD, the person whose machine you infected will be able to read the source code for your worm. So you can't use symmetric encryption, or else they'll obviously just read the key out of the worm and decrypt their HD themselves.
You could solve this problem by using public key encryption--give the worm the public key but not the private key, encrypt using the public key, and sell the victim the private key.
Okay, but here's an additional challenge: you want your worm to be able to infect many machines, but you don't want there to be a single private key that can be used to decrypt all of them, you want your victims to all have to pay you individually. Luckily, everyone's machine has some unique ID that you can read when you're on the machine (e.g. the MAC address). However, the worm cannot communicate with other attacker-controlled machines after going onto a victim's machine.
Is there some way to use the ID to make it so that the victim has to pay for a separate private key for each infected machine?
Basically, what I want is `f(seed, private_key) -> private_key` and `g(seed, public_key) -> public_key` such that `decrypt(encrypt(message, g(seed, public_key)), f(seed, private_key)) = message`, but such that knowing `seed`, `public_key`, and `f(seed2, private_key)` doesn’t help you decrypt a message encrypted with `f(seed, private_key)`.
One lame strategy would be to just have lots of public keys in your worm, and choose between them based on seed. But this would require that your worm be exponentially large.
Another strategy would be to have some monoidal operation on public keys, such that compose_public(public_key1, public_key2) gives you a public key which encrypts compose_private(private_key1, private_key2), but such that these keys were otherwise unrelated. If you had this, your worm could store two public keys and combine them based on the seed.
Thanks for writing this; I agree with most of what you’ve said. I wish the terminology was less confusing.
One clarification I want to make, though:
You describe deceptive alignment as being about the model taking actions so that the reward-generating process thinks that the actions are good. But most deceptive alignment threat models involve the model more generally taking actions that cause it to grab power later.
Some examples of such actions that aren’t getting better train loss or train-time reward:
Another important point on this topic is that I expect it's impossible to produce weak-to-strong generalization techniques that look good according to meta-level adversarial evaluations, while I expect that some scalable oversight techniques will look good by that standard. And so it currently seems to me that scalable-oversight-style techniques are a more reliable response to the problem "your oversight performs worse than you expected, because your AIs are intentionally subverting the oversight techniques whenever they think you won't be able to evaluate that they're doing so".
I think this point is incredibly important and quite underrated, and safety researchers often do way dumber work because they don't think about it enough.
Thanks for the links! I agree that the usecases are non-zero.