Strong upvoted! The issue where weights that give the gradient hacker any influence at all will be decreased if it causes bad outputs was one of the objections I also had to that gradient hacking post.
I wrote this post a while back where I managed to create to toy model for things that were not quite gradient hackers, but were maybe a more primitive version: https://www.lesswrong.com/posts/X7S3u5E4KktLp7gHz/tessellating-hills-a-toy-model-for-demons-in-imperfect
In terms of ways to create gradient hackers in an actual neural network, here are some suggestions:
In the post I linked above, I chose a loss function that could be decreased indefinitely in a linear way. This gives the gradient hackers plenty of "fuel" to eat, so I think it may be helpful to also have a loss function like this when you're starting out, in order to make the problem easier. A more sophisticated gradient hacker probably wouldn't need this, it could instead take advantage of the way loss varies across different mini-batches to get its "fuel".
You mention that gradient descent always takes a step in the direction of maximum descent. This is true as far as it goes, but the direction of maximum descent depends on the metric you have on the space of parameters. A different metric gives a different direction of maximum descent, and the usual metric on the weights isn't particularly natural, a different metric might work just as well. Which direction is actually the best for reducing loss isn't really something that gradient descent knows. Maybe the actual best direction differs from the direction of the gradient by a 60 degree angle or something. In particular, if you can set up a ridge in the loss landscape, gradient descent will follow along the ridge, and you can direct it like that. The tricky part is making sure that the ridge itself doesn't get optimized away.
The local nature of gradient descent is something we can take advantage of as gradient hackers. Consider the function y=13x3−x. Suppose the gradient hacker controls the input to a neuron with this as its activation function, and that neuron then influences the final decision. Suppose that for this particular training example, the gradients for the output of this neuron are positive; the more positive the output of the neuron, the lower the loss. We want positive gradients flowing back to our hacker, since if the gradients were negative, it would lose control of the neuron. Then the hacker can output 2, since the activation function has a positive derivative there. But, the hacker can also output -2, if it wants to make accuracy worse in this case, since the activation function also has a positive derivative at -2. So even though the hacker is making the prediction worse, it's increasing its control over that particular neuron. (Of course, that neuron itself will lose influence if we use it like this too many times. If the loss function is non-convex, then we can pass the buck all the way up to the loss function and exploit its non-convexity. Otherwise, maybe we could cycle between neurons so we don't use any of them up, or try some other clever thing?)
This is a difficulty that you didn't mention, but in creating a gradient hacker, there may be an aspect of quining. The gradient hacker has to reinforce all the weights that make it up. This is presumably a lot of information, more than we could usually just store in the weights themselves. If we could make the gradient hacker into a quine, then that would to do it, but this sounds really difficult to implement as the weights of a neural network in such a way that the output of the quine is encoded in the gradients of the corresponding weights.
Yep, that's the section I was looking at to get that information. Maybe I phrased it a bit unclearly. The thing that would contradict existing observations is if the interaction were not stochastic. Since it is stochastic in Oppenheim's theory, the theory allows the interference patterns that we observe, so there's no contradiction.
Outside view: This looks fairly legit on first glance, and Jonathan Oppenheim is a reputable physicist. The theory is experimentally testable, with numerous tests mentioned in the paper, and the tests don't require reaching unrealistically high energies in a particle accelerator, which is good.
Inside view: Haven't fully read the paper yet, so take with a grain of salt. Quantum mechanics already has a way of representing states with classical randomness, the density matrix, so having a partially classical and partially quantum theory certainly seems like it should be mathematically possible in the framework of QM. The paper addresses the obvious question of what happens to the gravitational field if we put a particle in a superposition of locations, and it seems the answer is that there is stochastic coupling between the quantum degrees of freedom and the classical gravitational field, and so particles don't end up losing their coherence in double slit experiments, which would blatantly contradict existing observations.
Overall, I think there's a high chance that this is a mathematically consistent theory that basically does what it says it does. Will it end up corresponding to the actual universe? That's a question for experiment.
0.5 probability you're in a simulation is the lower bound, which is only fulfilled if you pay the blackmailer. If you don't pay the blackmailer, then the chance you're in a simulation is nearly 1.
Also, checking if you're in a simulation is definitely a good idea, I try to follow a decision theory something like UDT, and UDT would certainly recommend checking whether or not you're in a simulation. But the Blackmailer isn't obligated to create a simulation with imperfections that can be used to identify the simulation and hurt his prediction accuracy. So I don't think you can really can say for sure "I would notice", just that you would notice if it were possible to notice. In the least convenient possible world for this thought experiment, the blackmailer's simulation is perfect.
Last thing: What's the deal with these hints that people actually died in the real world from using FDT? Is this post missing a section, or is it something I'm supposed to know about already?
There is a hole at the bottom of functional decision theory, a dangerous edge case which can and has led multiple highly intelligent and agentic rationalists to self-destructively spiral and kill themselves or get themselves killed.
Please don’t actually implement this unpatched? It’s already killed enough brilliant minds.
I think the issue boils down to one of types and not being able to have a "Statement" type in the theory. This is why we have QUOT[X] to convert a statement X into a string. QUOT is not a function, really, it's a macro that converts a statement into a string representation of that statement. true(QUOT[X]) ⇔ X isn't an axiom, it's an infinite sequence of axioms (a "schema"), one for each possible statement X. It's considered okay to have an infinite sequence of axioms, so long as you know how to compute that sequence. We can enumerate through all possible statements X, and we know how to convert any of those statements into a string using QUOT, so that's all okay. But we can't boil down that infinite axiom schema into a single axiom ∀ S:Statement, true(quot(S)) ⇒ S because we don't have a Statement type inside of the system.
true(QUOT[X]) ⇔ X
∀ S:Statement, true(quot(S)) ⇒ S
Why can't we have a Statement type? Well, we could if they were just constants that took on values of "true" or "false". But, I think what you want to do here is treat statements as both sequences of symbols and as things that can directly be true or false. Then the reasoning system would have ways of combining the sequences of symbols and axioms that map to rules of inference on those symbols.
Imagine what would happen if we did have all those things. I'll define a notation for a statement literal as state(s), where s is the string of symbols that make up the statement. So state() is kind of an inverse of QUOT, except that it's a proper function, not a macro. Since not all strings might form valid statements, we'll take state(s) to return some default statement like false when s is not valid.
Here is the paradox. We could construct the statement: ∀ S:Statement, ∀ fmtstr:String,(fmtstr = "..." ⇒ (S = state(replace(fmtstr, "%s", repr(fmtstr))) ⇒ ¬S)) where the "..." is "∀ S:Statement, ∀ fmtstr:String,(fmtstr = %s ⇒ (S = state(replace(fmtstr, \"\%s\", repr(fmtstr))) ⇒ ¬S))" So written out in full, the statement would be:
∀ S:Statement, ∀ fmtstr:String,(fmtstr = "..." ⇒ (S = state(replace(fmtstr, "%s", repr(fmtstr))) ⇒ ¬S))
"∀ S:Statement, ∀ fmtstr:String,(fmtstr = %s ⇒ (S = state(replace(fmtstr, \"\%s\", repr(fmtstr))) ⇒ ¬S))"
∀ S:Statement, ∀ fmtstr:String,(fmtstr = "∀ S:Statement, ∀ fmtstr:String,(fmtstr = %s ⇒ (S = state(replace(fmtstr, \"\%s\", repr(fmtstr))) ⇒ ¬S))" ⇒ (S = state(replace(fmtstr, "%s", repr(fmtstr))) ⇒ ¬S))
Now consider the statement itself as S in the quantifier, and suppose that fmtstr is indeed equal to "...". Then S = state(replace(fmtstr, "%s", repr(fmtstr))) is true. Then we have ¬S. On the other hand, if S or fmtstr take other values, then the conditional implications become vacuously true. So S reduces down entirely to ¬S. This is a contradiction. Not the friendly quine-based paradox of Godel's incompleteness theorem, which merely asserts provability, but an actual logic-exploding contradiction.
S = state(replace(fmtstr, "%s", repr(fmtstr)))
Therefore we can't allow a Statement type in our logic.
Yeah, it definitely depends how you formalize the logic, which I didn't do in my comment above. I think there's some hidden issues with your proposed disproof, though. For example, how do we formalize 2? If we're representing John's utterances as strings of symbols, then one obvious method would be to write down something like: ∀ s:String, says(John, s) ⇒ true(s). This seems like a good way of doing things, that doesn't mention the ought predicate. Unfortunately, it does require the true predicate, which is meaningless until we have a way of enforcing that for any statement S, S ⇔ true(QUOT[S]). We can do this with an axiom schema: SCHEMA[S:Statement], S ⇔ true(QUOT[S]). Unfortunately, if we want to be able to do the reasoning chain says(John, QUOT[ought(X)]) therefore true(QUOT[ought(X)]) therefore ought(X), we find out that we used the axiom true(QUOT[ought(X)]) ⇔ ought(X) from the schema. So in order to derive ought(X), we still had to use an axiom with "ought" in it.
∀ s:String, says(John, s) ⇒ true(s)
S ⇔ true(QUOT[S])
SCHEMA[S:Statement], S ⇔ true(QUOT[S])
true(QUOT[ought(X)]) ⇔ ought(X)
I expect it's possible write a proof that "you can't derive a ought from an is", assuming we're reasoning in first order logic, with ought being a predicate in the logic. But it might be a little nontrivial from a technical perspective, since while we couldn't derive ought(X) from oughtless axioms, we could certainly derive things like ought(X) ∨ ¬ought(X) from the law of excluded middle, and then there would be many complications you could build up.
ought(X) ∨ ¬ought(X)
From a language perspective, I agree that's it's great to not worry about the is/ought distinction when discussing anything other than meta-ethics. It's kind of like how we talk about evolved adaptations as being "meant" to solve a particular problem, even though there was really no intention involved in the process. It's just such a convenient way of speaking, so everyone does it.
I'd guess I'd say that the despite this, the is/ought distinction remains useful in some contexts. Like if someone says "we get morality from X, so you have to believe X or you won't be moral", it gives you a shortcut to realizing "nah, even if I think X is false, I can continue to not do bad things".
What about that thing where you can't derive an "ought" from an "is"? Just from the standpoint of pure logic, we can't derive anything about morality from axioms that don't mention morality. If you want to derive your morality from the existence of God, you still need to add an axiom: "that which God says is moral is moral". On the other end of things, an atheist could still agree with a theist on all moral statements, despite not believing in God. Suppose that God says "A, B, C are moral, and X, Y, Z are immoral". Then an atheist working from the axioms "A, B, C are moral, and X, Y, Z are immoral" would believe the same things as a theist about what is moral, despite not believing in God.
Similarly, Darwin's theory of evolution is just a claim about how the various kinds of living things we see today arose on Earth. Forget about God and religion, it would be really weird if believing in this funny idea about how complexity and seeming goal-directness can arise from a competition between imperfect copies somehow made you into an evil person.
Indeed, claiming that atheism or evolution is what led to Nazi atrocities almost feels to me like giving too much slack to the Nazis and their collaborators. Millions of people are atheists, or believe in evolution, or both, and they don't end up committing murder, let alone genocide. Maybe we should just hold people responsible for their actions, and not treat them as automatons being piloted by memes?
As another example, imagine we're trying to prevent a similar genocide from happening in the future (which we are, in fact). Which strategy would be more effective?
Encourage belief in religion and discourage belief in evolution. Pass a law making church attendance mandatory, teach religion in schools. Hide the fossil record, and lock biology papers behind a firewall so that only medical doctors and biologists can see them. Prevent evolution from being taught in science classes, in favour of creationism.
Teach the history of the holocaust in schools, along with other genocides. In those lessons, emphasize how genocide is a terrible, very bad thing to do, and point out how ordinary people often go along with genocide, slavery, and other horrifying things, if they're not paying a lot of attention and being careful not to do that. From a legal perspective, put protections against authoritarianism in the constitution (eg. no arresting people for speaking out against the government).
Seems to me like option 2 would be much more effective, though from trying to pass your intellectual Turing test, I'd guess you'd maybe endorse doing both? (Though with option 1 softened to promote religion more through gradual cultural change than heavy-handed legal measures.)?
On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:
On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I'm pretty skeptical that it provides much value:
On using AI systems, in particular large language models, to advance alignment research: This is not going to work.