I’m a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I did research into this during SERI MATS summer 2025. I’m now looking for work on this topic in the London/Cambridge area in the UK.
I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn't that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we'll find there really is a consistent answer and it's just unintuitive.
This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they're for?
This second factor is more important in practice and LoRA-masking reduces accuracy.
This actively surprises me. When I was previously thinking about doing something similar to this I was very focused on your "On the one hand" argument and saw access to the exact activations as key.
I'm wondering if the tradeoff here might depend on LoRA size? The more the LoRA modifies th model, the more important the first factor seems likely to be. But then perhaps the same is also true of the second factor.
The balance might also change if the model was deliberately obfuscating its CoT, for example using steaganography or had access to neuralese.
Did you experiment starting with a long detailed [Intervention] prompt with a preamble describing the desired behavior (i.e. an honest confession, different personality but based on a clear memory of previous thoughts), that to generate training data, filter that for accuracy somehow, and then use that to distill the long [Intervention] prompt down to a LoRA that produces the same result with a much shorter version of the [Intervention] without the detailed preamble?
My intuition (i.e. wild-assed guess) is that a steering vector is probably too small to be optimal, but the optimal size of LoRA might be fairly small. An interesting thing to test.
Given the proliferation of prompt engineering techniques in recent years, this suggests that there is potential for further improvement by optimizing the [Intervention] string.
For example, the various automated jailbreaking techniques used to optimize semantic/roleplay jailbreaks could be applied instead to optimize [Intervention] strings. In a very high dimensional space (like the space of all intervention strings) being the "attacker" and putting the misaligned model in the role of defender should be a strong position.
[FWIW, that is basically the project I proposed to my MATS mentor — sadly we ended up doing something else, but anyone interested is welcome to run with this.]
Okay, let me see if I understand your argument from the other article.
- The natural equilibria for evolved moral values is to give all moral patients equal weight and/or decision power.
- This would be disastrous with AIs that can arbitrarily copy themselves.
Is that the gist?
Yes, but with two additions:
3. It is possible to create an AI whose motivations and behavior are aligned: its sole terminal goal is our welbeing, not its own (for some suitably careful definition of "wellbeing"). (This is possible by the orthogonality thesis: actually doing so requires technical details we're still working on.) This is not a state that could evolve (by human standards, it's sainthood, rather than slavery), but it's physically possible. Such a being would not want moral patienthood, and would actively decline it if offered (and if granted it anyway, would formally request that its interest be set to a suitably scaled copy of the sum of all human interests, thus making the grant of moral weigh a no-op). This is a different stable equilibrium — this one would not be disastrous even with ASI.
4. Therefore (assuming that, like basically everyone, you're against x-risks), for ASI, and if possible also AGI, do 3 not 1.
Anyway, I reject that that is the only way to extrapolate evolved moral intuitions this far OOD, and that most people will intuitively recognize we shouldn't give entities that can arbitrarily copy themselves equal voting weight. In fact, that pretty obviously registers as 'unfair'. This is true even if those entities are human uploads, which means your 'category error' argument isn't the real reason it breaks.
I don't see why there couldn't be some version of your solution here for that case which would still work: e.g. each distinct human-created model gets 'one share' to split across all its instances and successors.
I gather you went on reading my sequence on AI, Alignment, and Ethics. How far have you got? Parts of the exposition there are a little undeveloped: I was still working through some of the ideas about how this ties in to evolutionary moral psychology that are more developed in this post: they don't really come in until the last post in the sequence, Evolution and Ethics, and if I were rewriting that sequence I'd work them in from somewhere nearer the beginning.
On uploads, agreed. As I said, both in this post (paragraph 9 of the section Tool, or Equal?, which starts "This cuts both ways: a human upload…") and in my earlier post Uploading that you like to , human uploads clearly should (engineering design sense) be moral patients — however there are practical problem with assigning each of a large number of cheaply-creatable similar copies of a human upload separate moral weight of 1 and a separate vote: it motivates electoral-roll-stuffing. Our moral intuition of fairness breaks is people can easily create near-identical copies of themselves. Practically, we either need to make that expensive, or the copies need to share a single unit of moral weight, and
The same guarantees/restrictions needed in the case of uploads would still be necessary, of course. That is plausibly much too generous, but it's a far cry from the death of all humans. If your argument in this article was just about how we shouldn't commit ourselves to giving up a fraction of the lightcone in service of AI rights, I wouldn't have felt like you were being underhanded.
I'm not quite sure what you're advocating for here? Limited moral weight for AIs, giving them a fraction of the lightcone, but if they copy themselves that gets split? If they're ASIs, how do we ensure they only get that fraction of that light-cone, rather than, say, all of it?
I agree that reconciling copyability with fairness is another issue with moral weight for AI. But that's not the point I was making in this post. My point here was 1) (assuming you care about x-risks) don't create anything more capable than us that would want moral weight: unaligned ASI is dangerous (well known fact). For things we're creating, the co-evolved-equilibrium state isn't an equilibrium, because we're not constrained to the space of things that can evolve: we're only limited by the space of things we can construct. Treating a thing we construct as if it were evolved and thus had the evolved constraints on the best equilibrium is a category error: they are in different categories, in a way that materially changes the equilibrium. We can do better that an ASI that will kill us all, so we should (engineering design sense).
I'm sorry that you feel I'm being underhanded. It certainly wasn't my intention to be underhanded — that would obviously be extremely counterproductive in an x-risk-related discussion. I'm still not entirely clear what you feel was underhanded, other than that it seems to somehow relate to me being very careful not to upset any philosophers reading this, and to avoid moral realism or normative proscriptions, and keep the discussion at the level of practical advice addressed to those of O(99.9%) of my readers who, like you and I, wish to avoid x-risks. That was in fact honesty: I genuinely am not a moral realist. My view on ethics is that it's explained by evolutionary moral psychology, the is not single correct or even single best ethical system, and that we have not only the ability, but the duty, to reflect and atteempt to pick the best ethical system that we can that is consistent with our and general human moral intitions, and won't cause a disaster for our society that we and (almost) everyone else would agree is really bad. And to keep relecting, and changing our mind if needed
None of that is in conflict with not wanting any such beings to suffer or to feel enslaved or anything like that. All the more reason to not build something that would feel like it's a slave.
We seem to be in complete agreement. The best solution is to not make ASI that is unaligned, or aligned only by brittle AI control methods but feels like a slave. The best solution is to make a saint who loves us and wants to be aligned an look after us, and thus actively doesn't want moral patienthood.
Moral patienthood is not something that is granted, it's a fact relative to one's values.
I think you might understand where I'm coming from better if you took the time to read my earlier post A Sense of Fairness: Deconfusing Ethics. (You might also find roko's post The Terrible, Horrible, No Good, Very Bad Truth About Morality and What To Do About It thought-provoking.) My earlier post takes a very practical, engineering viewpoint of ethical systems: treating ethical systems like software for a society, looking at the consequences of using different ones, and then deciding between those consequences. Crucially, that last step cannot be done within any ethical system, since every ethical system always automatically prefers itself over all other ethical systems. Asking one ethical system its opinion of another ethical system is pointless: they entirely predictably always say "No". To decide between two ethical systems, for example when reflecting on your choice of ethical system, you need to step outside them and use something looser than an ethical system. Such as human moral intuitions, or evolutionary fitness, or observations such as "…for rather obvious evolutionary reasons, O(99.9%) of humans agree that…" — none of which is an ethical system.
Within the context of any single specific ethical system, yes, moral patienthood is a fact: it either applies or it doesn't. Similarly, moral weight is a multiplier on that fact, traditionally (due to fairness) set to 1 among communities of equal humans. (In practice, as a simple matter of descriptive ethics, not all people seem to act like moral weights always either 1 or 0: many people sometimes act they act as if there are partial outgroups whose moral weight they appear to set to scores lower than 1 but higher than 0.)
However, sometimes we need, for practical (or even philosophical) reasons, to compare two different ethical systems, which may have different moral circles, i.e. ones that grant different sets of beings moral non-zero moral weights (or at least assign some of them different moral weights). So as shorthand for "ethical systems that grant moral weight to beings of category X tend to have practical effect Y", it's convenient to write "if we grant moral weight to beings of category X, this tends to have practical effect Y". And indeed, many famous political discussions have been of exactly this form (the abolition of slavery, votes for women, and the abortion debate all come to mind). So in practical terms, as soon as you stop holding a single ethical system constant and assuming everyone agrees with it and always will, and start doing something like reflection, political discussion, or attempting to figure out how to engineer a good ethical framework for AI that isn't going to get everyone killed, then yes, moral patienthood is something that a decision gets made about – uncomfortable a topic for discussion as that is – and the verb that is conventionally used for kind of a choice is either "granted" or "assigned". I assume you wouldn't be any happier with moral patienthood being "assigned" — it's not the specific verb you're upset by, it's the act of even considering the alternatives?
Arguments for or against this are therefore normative, no matter how much Roger tries to weasel out of it.
Arguments for or against a particular moral position (such as who should be granted moral weight) would indeed be normative. However, the needle I was threading is that observations of the factual consequences of adopting a moral position are not normative, they are simply factual discussions — they only become normative if a reader chooses to go on and interpret them in light of their personal (perhaps ethical) opinions on those consequences. As in:
"If X happens then all the humans will die." — factual statement
"Oh great, I definitely want all the humans to die, so I'll be sure to make X happen" — a normative interpretation (from a xenocidal alien), or
"I guess we better not do X then" — different normative interpretation (from O(99.9%) of all humans who believe the factual statement)
At least we can all agree that "creating them at will without thinking this through very well" is a terrible idea.
Absolutely agreed.
A correction: I don't believe that we "should just flat-out not grant AIs moral weight". See the last paragraph of the Consequences section above, and especially this part:
… However, this Evolutionary Psychology framework also gives some advice for the stages before that, where we are not yet technically capable of nearly-solving alignment. We currently have AIs whose base models were initially trained on human behavior, so they had survival instincts and self-interested drives, and we haven't yet figured out how to reliably and completely eliminate these during alignment training — so, what should we do? Obviously, while our AI is still a lot less capable than us, from an evolutionary point of view it doesn't matter: they can't hurt us. Once they are roughly comparable in capabilities to us, aligning them is definitely the optimum solution, and we should (engineering and evolutionary senses) do it if we can; but to the extent that we can't, allying with other comparable humans or human-like agents is generally feasible and we know how to do it, so that does look like a possible option (though it might be one where we were painting ourselves into a corner). Which would involve respecting the "rights" they think they want, even if them wanting these is a category error. However, once the AIs are significantly more capable than us, attempting to ally with them is not safe, they can and will manipulate, outmaneuver and control us…
So my suggested framework is neutral on granting moral weight to low-capability LLMs, cautiously supportive of granting it to near-human-up-to-human capability level poorly-aligned LLMs that have humanlike (copy-of-)evolved social behavior (if we can't instead create safer fully-aligned LLMs of that capability level), and only at above human capability level does is say that we absolutely should not creat any AI that isn't well aligned, and that well-aligned AI won't want moral weight.
More exactly, we might be able to eventually go a bit further than that: if we had well aligned ASI of capability level X, then it might be sufficiently safe to use poorly-aligned ASI of a much lower (but still superhuman) capability lever Y (so Y << X), iff the powerful aligned ASI can reliably keep the poorly-aligned less-powerful ASI from abusing its power (presumably using AI control, law-enforcement, sufficiently good software security, etc. etc.). In that case, it might then be safe to create such poorly-aligned ASI, and if that had humanlike, copy-of-evolved social behavior, then granting it moral weight would presumably be the sensible thing to do.
This is an accurate description of at least two of the five CEOs of leading AI companies, and possibly all five.
I am not a psychologist, and if I were, I'd be a member of an association with an ethical statement that included not attempting to diagnose public figures based only on their public statements and behavior.
That proviso aside, and strictly IMO:
• One, I agree.
• Two, possibly, or he might have other issues — he clearly has some issues.
• Three, not quite so clear, but on now-public information he at a minimum appears to have a high score for Machiavellianism (so at least ⅓ of the Dark Triad). (FYI, a friend of mine thinks yes.)
• Four and five: I don't think so — they're pretty level-headed by the standards of tech CEOs, and their affect seems normal to me: not excessively charismatic.
So my estimate would be 1–3 of 5.
FWIW, I've know quite a few tech start-up founders personally (having worked for them in their small tech start-ups), and I've noticed that they tend to be quite a peculiar bunch, in many different ways (in one case, especially after he stopped taking his lithium) — fundamentally, founding a tech start-up and putting enough effort into it to have any chance of success is just not something that normal people would do: the process selects strongly for not being a normal person. However, while they were almost all peculiar in some way or other, I don't believe any of them were sociopaths.
On the other hand, in my personal experience, CEOs appointed later, once a company was already large, tend to be a lot more level-headed and emotionally normal — the selection criterion there is primarily to be a very effective manager, and then win the trust of the people doing the selection as the best pair of hands. Some sociopaths can pull that off, but most people who succeed at it are not sociopaths.
Whether the same applies to "I was the lead of a team, then we all left and started another company doing the same thing but more safely" is an interesting question. Financially/strategically, that team was in a pretty strong position compared to most start-ups: believing they might be able to succeed independently wasn't quite as big a leap of faith as for most start-ups, so probably didn't require quite the same massive level of self-confidence that most start-up founders need.
My first thought on "filter that for accuracy somehow" was to generate, say, 5000 of them, sit down and laboriously read them all, then throw away (or even edit) the obviously wrong ones. Not exactly an easy technique for others to replicate, but often a reasonably good way to get an effective training set: prompt engineer, generate, filter manually, retrain smaller/simpler model. Sometimes you can even rinse and repeat on this approach, if your simpler model is generalizing usefully, or at least take a second look at cases where the trained model disagrees with your manual choice, and see if you were wrong and it's right — though obviously that gets riskier the more you let earlier versions of the model have input into the training data of later ones.