I'm an artificial intelligence engineer in Silicon Valley with an interest in AI alignment and interpretability.


AI, Alignment, and Ethics


seems as if it breaks at least the spirit of their past commitments on how far they will push the frontier.

While they don't publish this, Claude 3 Opus is not quite as good as GTP-4 Turbo, though it is better than GPT-4. So no, they're clearly carefully not breaking their past commitments, just keeping up with the Altmans.

Humans (when awake, as long as they're not actors or mentally ill) have, roughly speaking, a single personality. The base model training of an LLM trains it to attempt to simulate anyone on the internet/in stories, so it doesn't have a single personality: it contains multitudes. Instruct training and prompting can try to overcome this, but they're never entirely successful.

More details here.

I completely agree. LLMs are so context dependent that just about any good or bad behavior that a significant number of instances of can be found in the training set can be elicited from them by suitable prompts. Fine tuning can increse their resistance to this, but not by anything like enough.. We either need to filter the training set, which risks them just not understanding bad behaviors, rather than actually knowing to avoid them, making it had to know what will happen when they in-context learn about them ,or else we need to use something like conditional pretraining along the lines I discuss in How to Control an LLM's Behavior (why my P(DOOM) went down).

If you are dubious that the methods of rationality work, I fear you are on the wrong website.

Directly, no. But the process of science (like any use of Bayesian reasoning) is intended to gradually make our ontology a better fit to more of reality. If that was working as intended, then we would expect it to come to require more and more effort to produce the evidence needed to cause a significant further paradigm shift across a significant area of science, because there are fewer and fewer major large-scale misconceptions left to fix. Over the last century, we have more and more people working as scientists, publishing more and more papers, yet the rate of significant paradigm shifts that have an effect across a significant area of science has been dropping. From which I deduce that it is likely that our ontology is a probably a significantly better fit to reality now than it was a century ago, let alone three centuries ago back in the 18th century as this post discusses. Certainly the size and detail of our scientific ontology have both increased dramatically.

Is this proof? No, as you correctly observe, proof would require knowing the truth about reality. It's merely suggestive supporting evidence. It's possible to contrive other explanations: it's also possible, if rather unlikely, that, for some reason (perhaps related to social or educational changes) all of those people working in science now are much stupider, more hidebound, or less original thinkers than the scientists a century ago, and that's why dramatic paradigm shifts are slower — but personally I think this is very unlikely.

It is also quite possible that this is more true in certain areas of science that are amenable to the mental capabilities and research methods of human researchers, and that there might be other areas that were resistant to these approaches (so our lack of progress in these areas is caused by inability, not us approaching our goal), but where the different capabilities of an AI might allow it to make rapid progress. In such an area, the AI's ontology might well be a significantly better fit to reality than ours.

It's also possible to commit to not updating on a specific piece of information with a specific probability  between 0 and 1. I could also have arbitrarily complex finite commitment structures such as "out of the set of bits {A, B, C, D, E}, I will update if and only if I learn that at least three of them are true" — something which could of course be represented by a separate bit derived from A, B, C, D, E in the standard three-valued logic that represents true, false, and unknown. Or I can do a "provisional commit" where I have decided not to update on a certain fact, and generally won't, but may under some circumstances run some computationally expensive operation to decide to uncommit. Whether or not I'm actually committed is then theoretically determinable, but may in practice have a significant minimal computational cost and/or informational requirements to determine (ones that I might sometimes have a motive to intentionally increase, if I wish to be hard-to-predict), so to some other computationally bounded or non-factually-omniscient agents this may be unknown.

For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivation to be hard to simulate, then it is to their advantage to be as hard as possible to simulate and to run a decision process that is as complex as possible. (For example "shortly before the upcoming impact in our game of chicken, leading up to the last possible moment I could swerve aside, I will have my entire life up to this point flash before by eyes, hash certain inobvious features of this, and, depending on the twelfth bit of the hash, I will either update my decision, or not, in a way that it is unlikely my opponent can accurately anticipate or calculate as fast as I can".)

In general, it's always possible for an agent to generate a random number that even a vastly-computationally-superior opponent cannot predict (using quantum sources of randomness, for example).

It's also possible to devise a stochastic non-linear procedure where it is computationally vastly cheaper for me to follow one randomly-selected branch of it than it is for someone trying to model me to run all branches, or even Monte-Carlo simulate a representative sample of them, and where one can't just look at the algorithm and reason about what the net overall probability of various outcomes is, because it's doing irreducibly complex things like loading random numbers into Turing machines or cellular automata and running the resulting program for some number of steps to see what output, if any, it gets. (Of course, I may also not know what the overall probability distribution from running such a procedure is, if determining that is very expensive, but then, I'm trying to be unpredictable.) So it's also possible to generate random output that even a vastly-computationally-superior opponent cannot even predict the probability distribution of.

In the counterfactual mugging case, call the party proposing the bet (the one offering $1000 and asking for $100) A, and the other party B. If B simply publicly and irrevocably precommits to paying the $100 (say by posting a bond), their expected gain is $450. If they can find a way to cheat, their maximum potential gain from the gamble is $500. So their optimal strategy is to initially do a (soft) commit to paying the $100, and then, either before the coin is tossed, and/or after that on the heads branch:

  1. Select a means of deciding on a probability  that I will update/renege after the coin lands if it's a heads, and (if the coin has not yet been tossed) optionally a way I could signal that. This means can include using access to true (quantum) randomness, hashing parts of my history selected somehow (including randomly), hashing new observations of the world I made after the coin landed, or anything else I want.
  2. Using << $50 worth of computational resources, run a simulation of party A in the tails branch running a simulation of me, and predict the probability distribution for their estimate of . If the mean of that is lower than then go ahead and run the means for choosing. Otherwise, try again (return to step 1), or, if the computational resources I've spent are approaching $50 in net value, give up and pay A the $100 if the coin lands (or has already landed) heads.

Meanwhile, on the heads branch, party A is trying to simulate party B running this process, and presumably is unwilling to spend more than some fraction of $1000 in computational resources to doing this. If party B did their calculation before the coin toss and chose to emit a signal(or leaked one), then party A has access to that, but obviously not to anything that only happened on the heads branch after the outcome of the coin toss was visible.

So this turns into a contest of who can more accurately and cost effectively simulate the other simulating them, recursively. Since B can choose a strategy, including choosing to randomly select obscure features of their past history and make these relevant to the calculation, while A cannot, B would seem to be at a distinct strategic advantage in this contest unless A has access to their entire history.

Agreed. But the observed slowing down (since, say, a century ago) in the rate of the paradigm shifts that are sometimes caused by things like discovering a new particle does suggest that out current ontology is now a moderately good fit to a fairly large slice of the world. And, I would claim, it is particularly likely to be fairly good fit for the problem of pointing to human values.

We also don't require that our ontology fits the AI's ontology, only that when we point to something in our ontology, it knows what we mean — something that basically happens by construction in an LLM, since the entire purpose that it's ontology/world-model was learned for was figuring out what we mean and may say next. We may have trouble interpreting its internals, but it's a trained expert in interpreting our natural languages.

It is of course possible that our ontology still contains invalid concepts comparable to "do animals have souls"? My claim is just that this is less likely now than it was in the 18th century, because we've made quite a lot of progress in understanding the world since then. Also, if it did, an LLM would still know all about this invalid concept and our beliefs about it, just like it knows all about our beliefs about things like vampires, unicorns, or superheroes.

On the wider set of cases you hint at, my current view would be that there are only two cases that I'm ethically comfortable with:

  1. an evolved sapient being with the usual self-interested behavior for that that our ethical system grants moral patient status (by default, roughly equal moral patient status, subject to some of the issues discussed in Part 5)
  2. an aligned constructed agent whose motivations are entirely creator-interested and actively doesn't want moral patient status (see Part 1 of this sequence for a detailed justification of this)

Everything else: domesticated animals, non-aligned AIs kept in line by threat of force, slavery, uploads, and so forth, I'm (to varying degrees obviously) concerned about the ethics of, but haven't really thought several of those through in detail. Not that we currently have much choice about domesticated animals, but I feel that at a minimum by creating them we take on a responsibility for them: it's now our job to shear all the sheep, for example.

I'd like to discuss this further, but since none of the people who disagree have mentioned why or how, I'm left to try to guess, which doesn't seem very productive. Do they think it's unlikely that a near-term AGI will contain an LLM, or do they disagree that you can (usually, though unreliably) use a verbal prompt to point at concepts in the LLM's world models, or do they have some other objection that hasn't occurred to me? A concrete example of what I'm discussing here would be Constitutional AI, as used by Anthropic, so it's a pretty-well-undertood concept that had actually been tried with some moderate success.

Load More