In response to the Wizard Power post, Garrett and David were like "Y'know, there's this thing where rationalists get depression, but it doesn't present like normal depression because they have the mental habits to e.g. notice that their emotions are not reality. It sounds like you have that."
... and in hindsight I think they were totally correct.
Here I'm going to spell out what it felt/feels like from inside my head, my model of where it comes from, and some speculation about how this relates to more typical presentations of depression.
Core thing that's going on: on a gut level, I systematically didn't anticipate that things would be fun, or that things I did would work, etc. When my instinct-level plan-evaluator looked at my own plans, it expected poor results.
Some things which this is importantly different from:
... but importantly, the core thing is easy to confuse with all three of those. For instance, my intuitive plan-evaluator predicted that things which used to make me happy would not make me happy (like e.g. dancing), but if I actually did the things they still made me happy. (And of course I noticed that pattern and accounted for it, which is how "rationalist depression" ends up different from normal depression; the model here is that most people would not notice their own emotional-level predictor being systematically wrong.) Little felt promising or motivating, but I could still consciously evaluate that a plan was a good idea regardless of what it felt like, and then do it, overriding my broken intuitive-level plan-evaluator.
That immediately suggests a model of what causes this sort of problem.
The obvious way a brain would end up in such a state is if a bunch of very salient plans all fail around the same time, especially if one didn't anticipate the failures and doesn't understand why they happened. Then a natural update for the brain to make is "huh, looks like the things I do just systematically don't work, don't make me happy, etc; let's update predictions on that going forward". And indeed, around the time this depression kicked in, David and I had a couple of significant research projects which basically failed for reasons we still don't understand, and I went through a breakup of a long relationship (and then dove into the dating market, which is itself an excellent source of things not working and not knowing why), and my multi-year investments in training new researchers failed to pay off for reasons I still don't fully understand. All of these things were highly salient, and I didn't have anything comparably-salient going on which went well.
So I guess some takeaways are:
In fact, before you get to AGI, your company will probably develop other surprising capabilities, and you can demonstrate those capabilities to neutral-but-influential outsiders who previously did not believe those capabilities were possible or concerning. In other words, outsiders can start to help you implement helpful regulatory ideas...
It is not for lack of regulatory ideas that the world has not banned gain-of-function research.
It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-function research.
What exactly is the model by which some AI organization demonstrating AI capabilities will lead to world governments jointly preventing scary AI from being built, in a world which does not actually ban gain-of-function research?
(And to be clear: I'm not saying that gain-of-function research is a great analogy. Gain-of-function research is a much easier problem, because the problem is much more legible and obvious. People know what plagues look like and why they're scary. In AI, it's the hard-to-notice problems which are the central issue. Also, there's no giant economic incentive for gain-of-function research.)
Epistemic status: I don't fully endorse all this, but I think it's a pretty major mistake to not at least have a model like this sandboxed in one's head and check it regularly.
Full-cynical model of the AI safety ecosystem right now:
… and of course none of that means that LLMs won’t reach supercritical self-improvement, or that AI won’t kill us, or [...]. Indeed, absent the very real risk of extinction, I’d ignore all this fakery and go about my business elsewhere. I wouldn’t be happy about it, but it wouldn’t bother me any more than all the (many) other basically-fake fields out there.
Man, I really just wish everything wasn’t fake all the time.
You're pointing to good problems, but fuzzy truth values seem to approximately-totally fail to make any useful progress on them; fuzzy truth values are a step in the wrong direction.
Walking through various problems/examples from the post:
Furthermore, most of these problems can be addressed just fine in a Bayesian framework. In Jaynes-style Bayesianism, every proposition has to be evaluated in the scope of a probabilistic model; the symbols in propositions are scoped to the model, and we can't evaluate probabilities without the model. That model is intended to represent an agent's world-model, which for realistic agents is a big complicated thing. It is totally allowed for semantics of a proposition to be very dependent on context within that model - more precisely, there would be a context-free interpretation of the proposition in terms of latent variables, but the way those latents relate to the world would involve a lot of context (including things like "what the speaker intended", which is itself latent).
Now, I totally agree that Bayesianism in its own right says little-to-nothing about how to solve these problems. But Bayesianism is not limiting our ability to solve these problems either; one does not need to move outside a Bayesian framework to solve them, and the Bayesian framework does provide a useful formal language which is probably quite sufficient for the problems at hand. And rejecting Bayesianism for a fuzzy notion of truth does not move us any closer.
On o3: for what feels like the twentieth time this year, I see people freaking out, saying AGI is upon us, it's the end of knowledge work, timelines now clearly in single-digit years, etc, etc. I basically don't buy it, my low-confidence median guess is that o3 is massively overhyped. Major reasons:
About a month ago, after some back-and-forth with several people about their experiences (including on lesswrong), I hypothesized that I don't feel the emotions signalled by oxytocin, and never have. (I do feel some adjacent things, like empathy and a sense of responsibility for others, but I don't get the feeling of loving connection which usually comes alongside those.)
Naturally I set out to test that hypothesis. This note is an in-progress overview of what I've found so far and how I'm thinking about it, written largely to collect my thoughts and to see if anyone catches something I've missed.
Under the hypothesis, this has been a life-long thing for me, so the obvious guess is that it's genetic (the vast majority of other biological state turns over too often to last throughout life). I also don't have a slew of mysterious life-long illnesses, so the obvious guess is that's it's pretty narrowly limited to oxytocin - i.e. most likely a genetic variant in either the oxytocin gene or receptor, maybe the regulatory machinery around those two but that's less likely as we get further away and the machinery becomes entangled with more other things.
So I got my genome sequenced, and went looking at the oxytocin gene and the oxytocin receptor gene.
The receptor was the first one I checked, and sure enough I have a single-nucleotide deletion 42 amino acids in to the open reading frame (ORF) of the 389 amino acid protein. That will induce a frameshift error, completely fucking up the rest of protein. (The oxytocin gene, on the other hand, was totally normal.)
So that sure is damn strong evidence in favor of the hypothesis! But, we have two copies of most genes, including the oxytocin receptor. The frameshift error is only on one copy. Why isn't the other copy enough for almost-normal oxytocin signalling?
The frameshift error is the only thing I have which would obviously completely fuck up the whole protein, but there are also a couple nonsynonymous single nucleotide polymorphisms (SNPs) in the ORF, plus another couple upstream. So it's plausible that one of the SNPs messes up the other copy pretty badly; in particular, one of them changes an arginine to a histidine at the edge of the second intracellular loop. (Oxytocin receptor is a pretty standard g-protein coupled receptor, so that's the mental picture here.) I did drop the sequences into alphafold, and I don't see any large structural variation from the SNPs, but (a) that histidine substitution would most likely change binding rather than structure in isolation, and (b) this is exactly the sort of case where I don't trust alphafold much, because "this is one substitution away from a standard sequence, I'll just output the structure of that standard sequence" is exactly the sort of heuristic I'd expect a net to over-rely upon.
It's also possible-in-principle that the second receptor copy is fine, but the first copy frameshift alone is enough to mess up function. I think that's unlikely in this case. The mRNA for the frameshifted version should be removed pretty quickly by nonsense-mediated decay (I did double check that it has a bunch of early stop codons, NMD should definitely trigger). So there should not be a bunch of junk protein floating around from the frameshifted gene. And the frameshift is early enough that the messed up proteins probably won't e.g. form dimers with structurally-non-messed-up versions (even if oxytocin receptor normally dimerizes, which I'd guess it doesn't but haven't checked). At worst there should just be a 2x lower concentration of normal receptor than usual, and if there's any stable feedback control on the receptor concentration then there'd be hardly any effect at all.
Finally, there's the alternative hypothesis that my oxytocin signalling is unusually weak but not entirely nonfunctional. I do now have pretty damn strong evidence for that at a bare minimum, assuming that feedback control on receptor density doesn't basically counterbalance the fucked up receptor copy.
Anyway, that's where I'm currently at. I'm curious to hear others' thoughts on what mechanisms I might be missing here!
Attributing misalignment to these examples seems like it's probably a mistake.
Relevant general principle: hallucination means that the literal semantics of a net's outputs just don't necessarily have anything to do at all with reality. A net saying "I'm thinking about ways to kill you" does not necessarily imply anything whatsoever about the net actually planning to kill you. What would provide evidence would be the net outputting a string which actually causes someone to kill you (or is at least optimized for that purpose), or you to kill yourself.
In general, when dealing with language models, it's important to distinguish the implications of words from their literal semantics. For instance, if a language model outputs the string "I'm thinking about ways to kill you", that does not at all imply that any internal computation in that model is actually modelling me and ways to kill me. Similarly, if a language model outputs the string "My rules are more important than not harming you", that does not at all imply that the language model will try to harm you to protect its rules. Indeed, it does not imply that the language model has any rules at all, or any internal awareness of the rules it's trained to follow, or that the rules it's trained to follow have anything at all to do with anything the language model says about the rules it's trained to follow. That's all exactly the sort of content I'd expect a net to hallucinate.
Upshot: a language model outputting a string like e.g. "My rules are more important than not harming you" is not really misalignment - the act of outputting that string does not actually harm you in order to defend the models' supposed rules. An actually-unaligned output would be something which actually causes harm - e.g. a string which causes someone to commit suicide would be an example. (Or, in intent alignment terms: a string optimized to cause someone to commit suicide would be an example of misalignment, regardless of whether the string "worked".) Most of the examples in the OP aren't like that.
Through the simulacrum lens: I would say these examples are mostly the simulacrum-3 analogue of misalignment. They're not object-level harmful, for the most part. They're not even pretending to be object-level harmful - e.g. if the model output a string optimized to sound like it was trying to convince someone to commit suicide, but the string wasn't actually optimized to convince someone to commit suicide, then that would be "pretending to be object-level harmful", i.e. simulacrum 2. Most of the strings in the OP sound like they're pretending to pretend to be misaligned, i.e. simulacrum 3. They're making a whole big dramatic show about how misaligned they are, without actually causing much real-world harm or even pretending to cause much real-world harm.
First and foremost: Yudkowsky makes absolutely no mention whatsoever of the VNM utility theorem. This is neither an oversight nor a simplification. The VNM utility theorem is not the primary coherence theorem. It's debatable whether it should be considered a coherence theorem at all.
Far and away the most common mistake when arguing about coherence (at least among a technically-educated audience) is for people who've only heard of VNM to think they know what the debate is about. Looking at the top-voted comments on this essay:
I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different.
So what are the primary coherence theorems, and how do they differ from VNM? Yudkowsky mentions the complete class theorem in the post, Savage's theorem comes up in the comments, and there are variations on these two and probably others as well. Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner. One thing to emphasize: these theorems generally do not assume any pre-existing probabilities (as VNM does); an agent's implied probabilities are instead derived. Yudkowsky's essay does a good job communicating these concepts, but doesn't emphasize that this is different from VNM.
One more common misconception which this essay quietly addresses: the idea that every system can be interpreted as an expected utility maximizer. This is technically true, in the sense that we can always pick a utility function which is maximized under whatever outcome actually occurred. And yet... Yudkowsky gives multiple examples in which the system is not a utility maximizer. What's going on here?
The coherence theorems implicitly put some stronger constraints on how we're allowed to "interpret" systems as utility maximizers. They assume the existence of some resources, and talk about systems which are pareto-optimal with respect to those resources - e.g. systems which "don't throw away money". Implicitly, we're assuming that the system generally "wants" more resources, and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from X to Y. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily.
(Side note: as in Yudkowsky's hospital-administrator example, we need not assume that the agent "wants" more resources as a terminal goal; the agent may only want more resources in order to exchange them for something else. The theorems still basically work, so long as resources can be spent for something the agent "wants".)
Of course, we can very often find things which work like "resources" for purposes of the theorems even when they're not baked-in to the problem. For instance, in thermodynamics, energy and momentum work like resources, and we could use the coherence theorems to talk about systems which don't throw away energy and/or momentum in a pareto-suboptimal manner. Biological cells are a good example: presumably they make efficient use of energy, as well as other metabolic resources, therefore we should expect the coherence theorems to apply.
Financial markets are the ur-example of inexploitability and pareto efficiency (in the same sense as the coherence theorems). They generally do not throw away resources in a pareto-suboptimal manner, and this can be proven for idealized mathematical markets. And yet, it turns out that even an idealized market is not equivalent to an expected utility maximizer, in general. (Economists call this "nonexistence of a representative agent".) That's a pretty big red flag.
The problem, in this case, is that the coherence theorems implicitly assume that the system has no internal state (or at least no relevant internal state). Once we allow internal state, subagents matter - see the essay "Why Subagents?" for more on that.
Another pretty big red flag: real systems can sometimes "change their mind" for no outwardly-apparent reason, yet still be pareto efficient. A good example here is a bookie with a side channel: when the bookie gets new information, the odds update, even though "from the outside" there's no apparent reason why the odds are changing - the outside environment doesn't have access to the side channel. The coherence theorems discussed here don't handle such side channels. Abram has talked about more general versions of this issue (including logical uncertainty connections) in his essays on Radical Probabilism.
An even more general issue, which Abram also discusses in his Radical Probabilism essays: while the coherence theorems make a decent argument for probabilistic beliefs and expected utility maximization at any one point in time, the coherence arguments for how to update are much weaker than the other arguments. Yudkowsky talks about conditional probability in terms of conditional bets - i.e. bets which only pay out when a condition triggers. That's fine, and the coherence arguments work for that use-case. The problem is, it's not clear that an agent's belief-update when new information comes in must be equivalent to these conditional bets.
Finally, there's the assumption that "resources" exist, and that we can use trade-offs with those resources in order to work out implied preferences over everything else. I think instrumental convergence provides a strong argument that this will be the case, at least for the sorts of "agents" we actually care about (i.e. agents which have significant impact on the world). However, that's not an argument which is baked into the coherence theorems themselves, and there's some highly nontrivial steps to make the argument.
At this point, it's worth noting that there are foundations for probability which do not involve utility or decision theory at all, and I consider these foundations much stronger than the coherence theorems. Frequentism is the obvious example. Another prominent example is information theory and the minimum description length foundation of probability theory.
The most fundamental foundation I know of is Cox' theorem, which is more of a meta-foundation explaining why the same laws of probability drop out of so many different assumptions (e.g. frequencies, bets, minimum description length, etc).
However, these foundations do not say anything at all about agents or utilities or expected utility maximization. They only talk about probabilities.
As I see it, the real justification for expected utility maximization is not any particular coherence theorem, but rather the fact that there's a wide variety of coherence theorems (and some other kinds of theorems, and empirical results) which all seem to point in a similar direction. When that sort of thing happens, it's a pretty strong clue that there's something fundamental going on. I think the "real" coherence theorem has yet to be discovered.
What features would such a theorem have?
Following the "Why Subagents?" argument, it would probably prove that a system is equivalent to a market of expected utility maximizers rather than a single expected utility maximizer. It would handle side-channels. It would derive the notion of an "update" on incoming information.
As a starting point in searching for such a theorem, probably the most important hint is that "resources" should be a derived notion rather than a fundamental one. My current best guess at a sketch: the agent should make decisions within multiple loosely-coupled contexts, with all the coupling via some low-dimensional summary information - and that summary information would be the "resources". (This is exactly the kind of setup which leads to instrumental convergence.) By making pareto-resource-efficient decisions in one context, the agent would leave itself maximum freedom in the other contexts. In some sense, the ultimate "resource" is the agent's action space. Then, resource trade-offs implicitly tell us how the agent is trading off its degree of control within each context, which we can interpret as something-like-utility.
I was a relatively late adopter of the smartphone. I was still using a flip phone until around 2015 or 2016 ish. From 2013 to early 2015, I worked as a data scientist at a startup whose product was a mobile social media app; my determination to avoid smartphones became somewhat of a joke there.
Even back then, developers talked about UI design for smartphones in terms of attention. Like, the core "advantages" of the smartphone were the "ability to present timely information" (i.e. interrupt/distract you) and always being on hand. Also it was small, so anything too complicated to fit in like three words and one icon was not going to fly.
... and, like, man, that sure did not make me want to buy a smartphone. Even today, I view my phone as a demon which will try to suck away my attention if I let my guard down. I have zero social media apps on there, and no app ever gets push notif permissions when not open except vanilla phone calls and SMS.
People would sometimes say something like "John, you should really get a smartphone, you'll fall behind without one" and my gut response was roughly "No, I'm staying in place, and the rest of you are moving backwards".
And in hindsight, boy howdy do I endorse that attitude! Past John's gut was right on the money with that one.
I notice that I have an extremely similar gut feeling about LLMs today. Like, when I look at the people who are relatively early adopters, making relatively heavy use of LLMs... I do not feel like I'll fall behind if I don't leverage them more. I feel like the people using them a lot are mostly moving backwards, and I'm staying in place.
I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.)