Wiki Contributions


Strong upvoted! The issue where weights that give the gradient hacker any influence at all will be decreased if it causes bad outputs was one of the objections I also had to that gradient hacking post.

I wrote this post a while back where I managed to create to toy model for things that were not quite gradient hackers, but were maybe a more primitive version: https://www.lesswrong.com/posts/X7S3u5E4KktLp7gHz/tessellating-hills-a-toy-model-for-demons-in-imperfect

In terms of ways to create gradient hackers in an actual neural network, here are some suggestions:

  1. In the post I linked above, I chose a loss function that could be decreased indefinitely in a linear way. This gives the gradient hackers plenty of "fuel" to eat, so I think it may be helpful to also have a loss function like this when you're starting out, in order to make the problem easier. A more sophisticated gradient hacker probably wouldn't need this, it could instead take advantage of the way loss varies across different mini-batches to get its "fuel".

  2. You mention that gradient descent always takes a step in the direction of maximum descent. This is true as far as it goes, but the direction of maximum descent depends on the metric you have on the space of parameters. A different metric gives a different direction of maximum descent, and the usual metric on the weights isn't particularly natural, a different metric might work just as well. Which direction is actually the best for reducing loss isn't really something that gradient descent knows. Maybe the actual best direction differs from the direction of the gradient by a 60 degree angle or something. In particular, if you can set up a ridge in the loss landscape, gradient descent will follow along the ridge, and you can direct it like that. The tricky part is making sure that the ridge itself doesn't get optimized away.

  3. The local nature of gradient descent is something we can take advantage of as gradient hackers. Consider the function . Suppose the gradient hacker controls the input to a neuron with this as its activation function, and that neuron then influences the final decision. Suppose that for this particular training example, the gradients for the output of this neuron are positive; the more positive the output of the neuron, the lower the loss. We want positive gradients flowing back to our hacker, since if the gradients were negative, it would lose control of the neuron. Then the hacker can output 2, since the activation function has a positive derivative there. But, the hacker can also output -2, if it wants to make accuracy worse in this case, since the activation function also has a positive derivative at -2. So even though the hacker is making the prediction worse, it's increasing its control over that particular neuron. (Of course, that neuron itself will lose influence if we use it like this too many times. If the loss function is non-convex, then we can pass the buck all the way up to the loss function and exploit its non-convexity. Otherwise, maybe we could cycle between neurons so we don't use any of them up, or try some other clever thing?)

  4. This is a difficulty that you didn't mention, but in creating a gradient hacker, there may be an aspect of quining. The gradient hacker has to reinforce all the weights that make it up. This is presumably a lot of information, more than we could usually just store in the weights themselves. If we could make the gradient hacker into a quine, then that would to do it, but this sounds really difficult to implement as the weights of a neural network in such a way that the output of the quine is encoded in the gradients of the corresponding weights.

Yep, that's the section I was looking at to get that information. Maybe I phrased it a bit unclearly. The thing that would contradict existing observations is if the interaction were not stochastic. Since it is stochastic in Oppenheim's theory, the theory allows the interference patterns that we observe, so there's no contradiction.

Outside view: This looks fairly legit on first glance, and Jonathan Oppenheim is a reputable physicist. The theory is experimentally testable, with numerous tests mentioned in the paper, and the tests don't require reaching unrealistically high energies in a particle accelerator, which is good.

Inside view: Haven't fully read the paper yet, so take with a grain of salt. Quantum mechanics already has a way of representing states with classical randomness, the density matrix, so having a partially classical and partially quantum theory certainly seems like it should be mathematically possible in the framework of QM. The paper addresses the obvious question of what happens to the gravitational field if we put a particle in a superposition of locations, and it seems the answer is that there is stochastic coupling between the quantum degrees of freedom and the classical gravitational field, and so particles don't end up losing their coherence in double slit experiments, which would blatantly contradict existing observations.

Overall, I think there's a high chance that this is a mathematically consistent theory that basically does what it says it does. Will it end up corresponding to the actual universe? That's a question for experiment.

0.5 probability you're in a simulation is the lower bound, which is only fulfilled if you pay the blackmailer. If you don't pay the blackmailer, then the chance you're in a simulation is nearly 1. 

Also, checking if you're in a simulation is definitely a good idea, I try to follow a decision theory something like UDT, and UDT would certainly recommend checking whether or not you're in a simulation. But the Blackmailer isn't obligated to create a simulation with imperfections that can be used to identify the simulation and hurt his prediction accuracy. So I don't think you can really can say for sure "I would notice", just that you would notice if it were possible to notice. In the least convenient possible world for this thought experiment, the blackmailer's simulation is perfect.

Last thing: What's the deal with these hints that people actually died in the real world from using FDT? Is this post missing a section, or is it something I'm supposed to know about already?

There is a hole at the bottom of functional decision theory, a dangerous edge case which can and has led multiple highly intelligent and agentic rationalists to self-destructively spiral and kill themselves or get themselves killed.

Please don’t actually implement this unpatched? It’s already killed enough brilliant minds.

I think the issue boils down to one of types and not being able to have a "Statement" type in the theory. This is why we have QUOT[X] to convert a statement X into a string. QUOT is not a function, really, it's a macro that converts a statement into a string representation of that statement. true(QUOT[X]) ⇔ X isn't an axiom, it's an infinite sequence of axioms (a "schema"), one for each possible statement X. It's considered okay to have an infinite sequence of axioms, so long as you know how to compute that sequence. We can enumerate through all possible statements X, and we know how to convert any of those statements into a string using QUOT, so that's all okay. But we can't boil down that infinite axiom schema into a single axiom ∀ S:Statement, true(quot(S)) ⇒ S because we don't have a Statement type inside of the system.

Why can't we have a Statement type? Well, we could if they were just constants that took on values of "true" or "false". But, I think what you want to do here is treat statements as both sequences of symbols and as things that can directly be true or false. Then the reasoning system would have ways of combining the sequences of symbols and axioms that map to rules of inference on those symbols.

Imagine what would happen if we did have all those things. I'll define a notation for a statement literal as state(s), where s is the string of symbols that make up the statement. So state() is kind of an inverse of QUOT[], except that it's a proper function, not a macro. Since not all strings might form valid statements, we'll take state(s) to return some default statement like false when s is not valid.

Here is the paradox. We could construct the statement: ∀ S:Statement, ∀ fmtstr:String,(fmtstr = "..." ⇒ (S = state(replace(fmtstr, "%s", repr(fmtstr))) ⇒ ¬S)) where the "..." is "∀ S:Statement, ∀ fmtstr:String,(fmtstr = %s ⇒ (S = state(replace(fmtstr, \"\%s\", repr(fmtstr))) ⇒ ¬S))" So written out in full, the statement would be:

∀ S:Statement, ∀ fmtstr:String,(fmtstr = "∀ S:Statement, ∀ fmtstr:String,(fmtstr = %s ⇒ (S = state(replace(fmtstr, \"\%s\", repr(fmtstr))) ⇒ ¬S))" ⇒ (S = state(replace(fmtstr, "%s", repr(fmtstr))) ⇒ ¬S))

Now consider the statement itself as S in the quantifier, and suppose that fmtstr is indeed equal to "...". Then S = state(replace(fmtstr, "%s", repr(fmtstr))) is true. Then we have ¬S. On the other hand, if S or fmtstr take other values, then the conditional implications become vacuously true. So S reduces down entirely to ¬S. This is a contradiction. Not the friendly quine-based paradox of Godel's incompleteness theorem, which merely asserts provability, but an actual logic-exploding contradiction.

Therefore we can't allow a Statement type in our logic.

Yeah, it definitely depends how you formalize the logic, which I didn't do in my comment above. I think there's some hidden issues with your proposed disproof, though. For example, how do we formalize 2? If we're representing John's utterances as strings of symbols, then one obvious method would be to write down something like: ∀ s:String, says(John, s) ⇒ true(s). This seems like a good way of doing things, that doesn't mention the ought predicate. Unfortunately, it does require the true predicate, which is meaningless until we have a way of enforcing that for any statement S, S ⇔ true(QUOT[S]). We can do this with an axiom schema: SCHEMA[S:Statement], S ⇔ true(QUOT[S]). Unfortunately, if we want to be able to do the reasoning chain says(John, QUOT[ought(X)]) therefore true(QUOT[ought(X)]) therefore ought(X), we find out that we used the axiom true(QUOT[ought(X)]) ⇔ ought(X) from the schema. So in order to derive ought(X), we still had to use an axiom with "ought" in it.

I expect it's possible write a proof that "you can't derive a ought from an is", assuming we're reasoning in first order logic, with ought being a predicate in the logic. But it might be a little nontrivial from a technical perspective, since while we couldn't derive ought(X) from oughtless axioms, we could certainly derive things like ought(X) ∨ ¬ought(X) from the law of excluded middle, and then there would be many complications you could build up.

From a language perspective, I agree that's it's great to not worry about the is/ought distinction when discussing anything other than meta-ethics. It's kind of like how we talk about evolved adaptations as being "meant" to solve a particular problem, even though there was really no intention involved in the process. It's just such a convenient way of speaking, so everyone does it.

I'd guess I'd say that the despite this, the is/ought distinction remains useful in some contexts. Like if someone says "we get morality from X, so you have to believe X or you won't be moral", it gives you a shortcut to realizing "nah, even if I think X is false, I can continue to not do bad things".

What about that thing where you can't derive an "ought" from an "is"? Just from the standpoint of pure logic, we can't derive anything about morality from axioms that don't mention morality. If you want to derive your morality from the existence of God, you still need to add an axiom: "that which God says is moral is moral". On the other end of things, an atheist could still agree with a theist on all moral statements, despite not believing in God. Suppose that God says "A, B, C are moral, and X, Y, Z are immoral". Then an atheist working from the axioms "A, B, C are moral, and X, Y, Z are immoral" would believe the same things as a theist about what is moral, despite not believing in God.

Similarly, Darwin's theory of evolution is just a claim about how the various kinds of living things we see today arose on Earth. Forget about God and religion, it would be really weird if believing in this funny idea about how complexity and seeming goal-directness can arise from a competition between imperfect copies somehow made you into an evil person.

Indeed, claiming that atheism or evolution is what led to Nazi atrocities almost feels to me like giving too much slack to the Nazis and their collaborators. Millions of people are atheists, or believe in evolution, or both, and they don't end up committing murder, let alone genocide. Maybe we should just hold people responsible for their actions, and not treat them as automatons being piloted by memes?

As another example, imagine we're trying to prevent a similar genocide from happening in the future (which we are, in fact). Which strategy would be more effective?

  1. Encourage belief in religion and discourage belief in evolution. Pass a law making church attendance mandatory, teach religion in schools. Hide the fossil record, and lock biology papers behind a firewall so that only medical doctors and biologists can see them. Prevent evolution from being taught in science classes, in favour of creationism.

  2. Teach the history of the holocaust in schools, along with other genocides. In those lessons, emphasize how genocide is a terrible, very bad thing to do, and point out how ordinary people often go along with genocide, slavery, and other horrifying things, if they're not paying a lot of attention and being careful not to do that. From a legal perspective, put protections against authoritarianism in the constitution (eg. no arresting people for speaking out against the government).

Seems to me like option 2 would be much more effective, though from trying to pass your intellectual Turing test, I'd guess you'd maybe endorse doing both? (Though with option 1 softened to promote religion more through gradual cultural change than heavy-handed legal measures.)?

On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:

  1. Practical considerations: AI systems currently tend to require lots of examples and it's expensive to get these if they all have to be provided by a human.
  2. Some actions look good to a casual human observer, but are actually bad on closer inspection. The AI would be rewarded for finding and taking such actions.
  3. If you're training a neural network, then there are generically going to be lots of adversarial examples for that network. As the AI gets more and more powerful, we'd expect it to be able to generate more and more situations where its learned value function gives a high reward but a human would give a low reward. So it seems like we end up playing a game of adversarial example whack-a-mole for a long time, where we're just patching hole after hole in this million-dimensional bucket with thousands of holes. Probably the AI manages to kill us before that process converges.
  4. To make the above worse, there's this idea of a sharp left turn, where a sufficiently intelligent AI can think of very weird plans that go far outside of the distribution of scenarios that it was trained on. We expect generalization to get worse in this regime, and we also expect an increased frequency of adversarial examples. (What would help a lot here is designing the AI to have an interpretable planning system, where we could run these plans forward and negatively reinforce the bad ones (and maybe all the weird ones, because of corrigibility reasons, though we'd have to be careful about how that's formulated because we don't want the AI trying to kill us because it thinks we'd produce a weird future).)
  5. Once the AI is modelling reality in detail, its reward function is going to focus on how the rewards are actually being piped to the AI, rather than the human evaluator's reaction, let alone of some underlying notion of goodness. If the human evaluators just press a button to reward the AI for doing a good thing, the AI will want to take control of that button and stick a brick on top of it.

On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I'm pretty skeptical that it provides much value:

  1. The AI can try and fool the critic just like it would fool humans. It doesn't even need a realistic world model for this, since using the critic to inform the training labels leaks information about the critic to the AI.
  2. It's therefore very important that the critic model generates all the strong and relevant criticisms of a particular AI output. Otherwise the AI could just route around the critic.
  3. On some kinds of task, you'll have an objective source of truth you can train your model on. The value of an objective source of truth is that we can use it to generate a list of all the criticisms the model should have made. This is important because we can update the weights of the critic model based on any criticisms it failed to make. On other kinds of task, which are the ones we're primarily interested in, it will be very hard or impossible to get the ground truth list of criticisms. So we won't be able to update the weights of the model that way when training. So in some sense, we're trying to generalize this idea of "a strong a relevant criticism" between these different tasks of differing levels of difficulty.
  4. This requirement of generating all criticisms seems very similar to the task of getting a generative model to cover all modes. I guess we've pretty much licked mode collapse by now, but "don't collapse everything down to a single mode" and "make sure you've got good coverage of every single mode in existence" are different problems, and I think the second one is much harder.

On using AI systems, in particular large language models, to advance alignment research: This is not going to work.

  1. LLMs are super impressive at generating text that is locally coherent for a much broader definition of "local" than was previously possible. They are also really impressive as a compressed version of humanity's knowledge. They're still known to be bad at math, at sticking to a coherent idea and at long chains of reasoning in general. These things all seem important for advancing AI alignment research. I don't see how the current models could have much to offer here. If the thing is advancing alignment research by writing out text that contains valuable new alignment insights, then it's already pretty much a human-level intelligence. We talk about AlphaTensor doing math research, but even AlphaTensor didn't have to type up the paper at the end!
  2. What could happen is that the model writes out a bunch of alignment-themed babble, and that inspires a human researcher into having an idea, but I don't think that provides much acceleration. People also get inspired while going on a walk or taking a shower.
  3. Maybe something that would work a bit better is to try training a reinforcement-learning agent that lives in a world where it has to solve the alignment problem in order to achieve its goals. Eg. in the simulated world, your learner is embodied in a big robot, and it there's a door in the environment it can't fit through, but it can program a little robot to go through the door and perform some tasks for it. And there's enough hidden information and complexity behind the door that the little robot needs to have some built-in reasoning capability. There's a lot of challenges here, though. Like how do you come up with a programming environment that's simple enough that the AI can figure out how to use it, while still being complex enough that the little robot can do some non-trivial reasoning, and that the AI has a chance of discovering a new alignment technique? Could be it's not possible at all until the AI is quite close to human-level.
Load More