RogerDearnaley

I'm an staff artificial intelligence engineer currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm now actively looking for employment working in this area.

Sequences

AI, Alignment, and Ethics

Comments

Sorted by

The history of autocracies and monarchies suggests that taking something with the ethical properties of an average human being and handing it unconstrained power doesn't usually work out very well. So yes, to create an aligned ASI that is safe for us to share a planet with does require creating something morally 'better' than most humans. I'm not sure it needs to be perfect and ideal, as long as it is good enough and aspires to improve: they it can help us create better training data for its upgraded next version that will make that be closer to fully aligned; this is an implementation of Value Learning.

Answer by RogerDearnaley31

I guess the way I look at it is that "alignment" means "an AI system whose terminal goal is to achieve your goals". The distinction here is then whether the word 'your' means something closer to:

  1. the current user making the current request
  2. the current user making the current request, as long as the request is legal and inside the terms of service
  3. the shareholders of the foundation lab that made the AI
  4. all (righthinking) citizens of the country that foundation lab is in (and perhaps its allies)
  5. all humans everywhere, now and in the future
  6. all sapient living beings everywhere, now and in the future
  7. something even more inclusive

Your first option would be somewhere around item 5. or 6. on this list, while your second option would be closer to items 1., 2. or 3.

If AI doesn't kill or disenfranchise all of us, then which option on this spectrum of possibilities ends up being implemented is going to make a huge difference to how history will play out over the next few decades.

Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A "Bitter Lesson" Approach to Aligning AGI and ASI, and similar discussions.

This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of personas rather than having been deliberately narrowed down to mostly simulate just a single viewpoint, so it won't have a consistent alignment and thus won't consistently display alignment-faking behavior.

Well summarized — very similar to the conclusions I'd previously reached when I read the paper.

Another variant would be, rather than replacing what you believe is structureless noise with actual structureless noise as an intervention, to simply always run the model with an additional noise term added to each neuron, or to the residual stream between each layer, or whatever, both during training and inference. (combined with a weight decay or a loss term on activation amplitudes, this soft-limits the information capacity of any specific path through the neural net). This then forces any real mechanisms in the model to operate above this background noise level: so then, once you understand how the background noise level is propagated through the model, it becomes clear that any unexplained noise below that level is in fact structureless, since any structure will be washed out by the injected noise, whereas any unexplained noise level above that, while it could still be structureless, seems more likely to be unexplained structure.

(Note that this architectural change also gives the model a new non-linearity to use: in the presence of a fixed noise term, changes in activation norm near the noise level have non-linear effects.)

Quantizing model weights during training also has a somewhat similar effect, but is likely harder to analyze, since now the information capacity limit is per weight, not per data path.

Law-abiding – It cannot acquire money or compute illegally (fraud, theft, hacking, etc.) and must otherwise avoid breaking the law

Can it lobby? Run for office? Shop around for jurisdictions? Super-humanly persuade the electorate? Just find loopholes and workarounds to the law that make a corporate tax double-Irish look principled and simple?

Evolution was working within tight computational efficiency limits (the human brain burns roughly 1/6 of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, and we're now running the human brain well outside it's training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is between human instincts and human genetic fitness.

So:

  1. Use a model large enough to learn what you're trying to teach it
  2. Use stochastic gradient descent
  3. Ask your AI to monitor for inner alignment problems (we do know Doritos are bad for us)
  4. Retrain if you find yourself far enough outside your training distribution that inner alignment issues are becoming a problem

That is an impressive (and amusing) capability!

Presumably the fine-tuning enhanced previous experience in the model with acrostic text. That also seems to have enhanced the ability to recognize and correctly explain that the text is an acrostic, even with only two letters of the acrostic currently in context. Presumably it's fairly common to have both an acrostic and an explanation of it in the same document. What I suspect is rarer in the training data is for the acrostic text to explain itself, as the model's response did here (though doubtless there are some examples somewhere). However, this is mostly just combining two skills, something LLMs are clearly capable of — the impressive part here is just that the model was aware, at the end of the second line, what word starting with "HE" the rest of the acrostic was going to spell out.

It would be interesting to look at this in the activation space — does the model already have a strong internal activation somewhere inside it for "HELLO" (or perhaps ("H… E… L… L… O…") even while it's working on generating the first or second line? It presumably needs to have something like this to be able generate acrostics, and previous work has suggested that there are directions for "words starting with the letter <X>" in the latent spaces of typical models.

If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn't have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don't need to worry about it.

Darn, exactly the project I was hoping to do at MATS! :-) Nice work!

There's pretty suggestive evidence that the LLM first decides to refuse (and emits token's like "I'm sorry"), then later writes a justification for refusing (see some of the hilarious reasons generated for not telling you how to make a teddy bear, after being activation engineered into refusing this). So I would view arguing anything about the nature of the refusal process from the text of the refusal-justification given afterwards as circumstantial evidence at best. But then you have direct gradient evidence that these directions matter, so I suppose the refusal texts you quote, if considered just as an argument as to why it's sensible model behavior that that direction ought to matter (as opposed to evidence that it does), are helpful — however, I think you might want to make this distinction clearer in your write-up.

Looking through Latent 2213, my impression is that a) it mostly triggers on a wide variety of innocuous-looking tokens indicating the ends of phrases (so likely it's summarizing those phrases), and b) those phrases tend to be about a legal, medical, or social process or chain of consequences causing something really bad to happen (e.g. cancer, sexual abuse, poisoning). This also rather fits with the set of latents that it has significant cosine similarity to. So I'd summarize it as "a complex or technically-involved process leading to a dramatically bad outcome".

If that's accurate, then it tending to trigger the refusal direction makes a lot of sense.

Load More