johnswentworth — LessWrong

I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.)

Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren't smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback).
The "Getting What We Measure" scenario from Paul's old "What Failure Looks Like" post.
The "fusion power generator scenario".
Perhaps someone trains a STEM-AGI, which can't think about humans much at all. In the course of its work, that AGI reasons that an oxygen-rich atmosphere is very inconvenient for manufacturing, and aims to get rid of it. It doesn't think about humans at all, but the human operators can't understand most of the AI's plans anyway, so the plan goes through. As an added bonus, nobody can figure out why the atmosphere is losing oxygen until it's far too late, because the world is complicated and becomes more so with a bunch of AIs running around and no one AI has a big-picture understanding of anything either (much like today's humans have no big-picture understanding of the whole human economy/society).
People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they're already on the more-powerful next gen, so it's too late.
The classic overnight hard takeoff: a system becomes capable of self-improving at all but doesn't seem very alarmingly good at it, somebody leaves it running overnight, exponentials kick in, and there is no morning.
(At least some) AGIs act much like a colonizing civilization. Plenty of humans ally with it, trade with it, try to get it to fight their outgroup, etc, and the AGIs locally respect the agreements with the humans and cooperate with their allies, but the end result is humanity gradually losing all control and eventually dying out.
Perhaps early AGI involves lots of moderately-intelligent subagents. The AI as a whole mostly seems pretty aligned most of the time, but at some point a particular subagent starts self-improving, goes supercritical, and takes over the rest of the system overnight. (Think cancer, but more agentic.)
Perhaps the path to superintelligence looks like scaling up o1-style runtime reasoning to the point where we're using an LLM to simulate a whole society. But the effects of a whole society (or parts of a society) on the world are relatively decoupled from the things-individual-people-say-taken-at-face-value. For instance, lots of people talk a lot about reducing poverty, yet have basically-no effect on poverty. So developers attempt to rely on chain-of-thought transparency, and shoot themselves in the foot.

johnswentworth's Shortform

johnswentworth5mo16748

In response to the Wizard Power post, Garrett and David were like "Y'know, there's this thing where rationalists get depression, but it doesn't present like normal depression because they have the mental habits to e.g. notice that their emotions are not reality. It sounds like you have that."

... and in hindsight I think they were totally correct.

Here I'm going to spell out what it felt/feels like from inside my head, my model of where it comes from, and some speculation about how this relates to more typical presentations of depression.

Core thing that's going on: on a gut level, I systematically didn't anticipate that things would be fun, or that things I did would work, etc. When my instinct-level plan-evaluator looked at my own plans, it expected poor results.

Some things which this is importantly different from:

Always feeling sad
Things which used to make me happy not making me happy
Not having energy to do anything

... but importantly, the core thing is easy to confuse with all three of those. For instance, my intuitive plan-evaluator predicted that things which used to make me happy would not make me happy (like e.g. dancing), but if I actually did the things they still made me happy. (And of course I noticed that pattern and accounted for it, which is how "rationalist depression" ends up different from normal depression; the model here is that most people would not notice their own emotional-level predictor being systematically wrong.) Little felt promising or motivating, but I could still consciously evaluate that a plan was a good idea regardless of what it felt like, and then do it, overriding my broken intuitive-level plan-evaluator.

That immediately suggests a model of what causes this sort of problem.

The obvious way a brain would end up in such a state is if a bunch of very salient plans all fail around the same time, especially if one didn't anticipate the failures and doesn't understand why they happened. Then a natural update for the brain to make is "huh, looks like the things I do just systematically don't work, don't make me happy, etc; let's update predictions on that going forward". And indeed, around the time this depression kicked in, David and I had a couple of significant research projects which basically failed for reasons we still don't understand, and I went through a breakup of a long relationship (and then dove into the dating market, which is itself an excellent source of things not working and not knowing why), and my multi-year investments in training new researchers failed to pay off for reasons I still don't fully understand. All of these things were highly salient, and I didn't have anything comparably-salient going on which went well.

So I guess some takeaways are:

If a bunch of salient plans fail around the same time for reasons you don't understand, your instinctive plan-evaluator may end up with a global negative bias.
If you notice that, maybe try an antidepressant. Bupropion has been helpful for me so far, though it's definitely not the right tool for everyone (especially bad if you're a relatively anxious person; I am the opposite of anxious).

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

johnswentworth4yΩ5615821

In fact, before you get to AGI, your company will probably develop other surprising capabilities, and you can demonstrate those capabilities to neutral-but-influential outsiders who previously did not believe those capabilities were possible or concerning. In other words, outsiders can start to help you implement helpful regulatory ideas...

It is not for lack of regulatory ideas that the world has not banned gain-of-function research.

It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-function research.

What exactly is the model by which some AI organization demonstrating AI capabilities will lead to world governments jointly preventing scary AI from being built, in a world which does not actually ban gain-of-function research?

(And to be clear: I'm not saying that gain-of-function research is a great analogy. Gain-of-function research is a much easier problem, because the problem is much more legible and obvious. People know what plagues look like and why they're scary. In AI, it's the hard-to-notice problems which are the central issue. Also, there's no giant economic incentive for gain-of-function research.)

johnswentworth's Shortform

johnswentworth8mo15613

... But It's Fake Tho

Epistemic status: I don't fully endorse all this, but I think it's a pretty major mistake to not at least have a model like this sandboxed in one's head and check it regularly.

Full-cynical model of the AI safety ecosystem right now:

There’s OpenAI, which is pretending that it’s going to have full AGI Any Day Now, and relies on that narrative to keep the investor cash flowing in while they burn billions every year, losing money on every customer and developing a product with no moat. They’re mostly a hype machine, gaming metrics and cherry-picking anything they can to pretend their products are getting better. The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI.
Then there’s the AI regulation activists and lobbyists. They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People. Even if they do manage to pass any regulations on AI, those will also be mostly fake, because (a) these people are generally not getting deep into the bureaucracy which would actually implement any regulations, and (b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI. The activists and lobbyists are nominally enemies of OpenAI, but in practice they all benefit from pushing the same narrative, and benefit from pretending that everyone involved isn’t faking everything all the time.
Then there’s a significant contingent of academics who pretend to produce technical research on AI safety, but in fact mostly view their job as producing technical propaganda for the regulation activists and lobbyists. (Central example: Dan Hendrycks, who is the one person I directly name mainly because I expect he thinks of himself as a propagandist and will not be particularly offended by that description.) They also push the narrative, and benefit from it. They’re all busy bullshitting research. Some of them are quite competent propagandists though.
There’s another significant contingent of researchers (some at the labs, some independent, some academic) who aren’t really propagandists, but mostly follow the twitter-memetic incentive gradient in choosing their research. This tends to generate paper titles which sound dramatic, but usually provide pretty little conclusive evidence of anything interesting upon reading the details, and very much feed the narrative. This is the main domain of Not Measuring What You Think You Are Measuring and Symbol/Referent Confusions.
Then of course there’s the many theorists who like to build neat toy models which are completely toy and will predictably not generalize useful to real-world AI applications. This is the main domain of Ad-Hoc Mathematical Definitions, the theorists’ analogue of Not Measuring What You Think You Are Measuring.
Benchmarks. When it sounds like a benchmark measures something reasonably challenging, it nearly-always turns out that it’s not really measuring the challenging thing, and the actual questions/tasks are much easier than the pitch would suggest. (Central examples: software eng, GPQA, frontier math.) Also it always turns out that the LLMs’ supposedly-impressive achievement relied much more on memorization of very similar content on the internet than the benchmark designers expected.
Then there’s a whole crowd of people who feel real scared about AI (whether for good reasons or because they bought the Narrative pushed by all the people above). They mostly want to feel seen and validated in their panic. They have discussions and meetups and stuff where they fake doing anything useful about the problem, while in fact they mostly just emotionally vibe with each other. This is a nontrivial chunk of LessWrong content, as e.g. Val correctly-but-antihelpfully pointed out. It's also the primary motivation behind lots of "strategy" work, like e.g. surveying AI researchers about their doom probabilities, or doing timeline forecasts/models.

… and of course none of that means that LLMs won’t reach supercritical self-improvement, or that AI won’t kill us, or [...]. Indeed, absent the very real risk of extinction, I’d ignore all this fakery and go about my business elsewhere. I wouldn’t be happy about it, but it wouldn’t bother me any more than all the (many) other basically-fake fields out there.

Man, I really just wish everything wasn’t fake all the time.

Why I’m not a Bayesian

johnswentworth1y15659

You're pointing to good problems, but fuzzy truth values seem to approximately-totally fail to make any useful progress on them; fuzzy truth values are a step in the wrong direction.

Walking through various problems/examples from the post:

"For example, the truth-values of propositions which contain gradable adjectives like 'large' or 'quiet' or 'happy' depend on how we interpret those adjectives." You said it yourself: the truth-values depend on how we interpret those adjectives. The adjectives are ambiguous, they have more than one common interpretation (and the interpretation depends on context). Saying that "a description of something as 'large' can be more or less true depending on how large it actually is" throws away the whole interesting phenomenon here: it treats the statement as having a single fixed truth-value (which happens to be quantitative rather than 0/1), when the main phenomenon of interest is that humans use multiple context-dependent interpretations (rather than one interpretation with one truth value).
"For example, if I claim that there’s a grocery store 500 meters away from my house, that’s probably true in an approximate sense, but false in a precise sense." Right, and then the quantity you want is "to within what approximation?", where the approximation-error probably has units of distance in this example. The approximation error notably does not have units of truthiness; approximation error is usually not approximate truth/falsehood, it's a different thing.
<water in the eggplant>. As you said, natural language interpretations are usually context-dependent. This is just like the adjectives example: the interesting phenomenon is that humans interpret the same words in multiple ways depending on context. Fuzzy truth values don't handle that phenomenon at all; they still just have context-independent assignments of truth. Sure, you could interpret a fuzzy truth value as "how context-dependent is it?", but that's still throwing out nearly the entire interesting phenomenon; the interesting questions here are things like "which context, exactly? How can humans efficiently cognitively represent and process that context and turn it into an interpretation?". Asking "how context-dependent is it?", as a starting point, would be like e.g. looking at neuron polysemanticity in interpretability, and investing a bunch of effort in measuring how polysemantic each neuron is. That's not a step which gets one meaningfully closer to discovering better interpretability methods.
"there's a tiger in my house" vs "colorless green ideas sleep furiously". Similar to looking at context-dependence and asking "how context-dependent is it?", looking at sense vs nonsense and asking "how sensical is it?" does not move one meaningfully closer to understanding the underlying gears of semantics and which things have meaningful semantics at all.
"We each have implicit mental models of our friends’ personalities, of how liquids flow, of what a given object feels like, etc, which are far richer than we can express propositionally." Well, far richer than we know how to express propositionally, and the full models would be quite large to write out even if we knew how. That doesn't mean they're not expressible propositionally. More to the point, though: switching to fuzzy truth values does not make us significantly more able to express significantly more of the models, or to more accurately express parts of the models and their relevant context (which I claim is the real thing-to-aim-for here).
- Note here that I totally agree that thinking in terms of large models, rather than individual small propositions, is the way to go; insofar as one works with propositions, their semantic assignments are highly dependent on the larger model. But that

Furthermore, most of these problems can be addressed just fine in a Bayesian framework. In Jaynes-style Bayesianism, every proposition has to be evaluated in the scope of a probabilistic model; the symbols in propositions are scoped to the model, and we can't evaluate probabilities without the model. That model is intended to represent an agent's world-model, which for realistic agents is a big complicated thing. It is totally allowed for semantics of a proposition to be very dependent on context within that model - more precisely, there would be a context-free interpretation of the proposition in terms of latent variables, but the way those latents relate to the world would involve a lot of context (including things like "what the speaker intended", which is itself latent).

Now, I totally agree that Bayesianism in its own right says little-to-nothing about how to solve these problems. But Bayesianism is not limiting our ability to solve these problems either; one does not need to move outside a Bayesian framework to solve them, and the Bayesian framework does provide a useful formal language which is probably quite sufficient for the problems at hand. And rejecting Bayesianism for a fuzzy notion of truth does not move us any closer.

johnswentworth's Shortform

johnswentworth1yΩ5414937

On o3: for what feels like the twentieth time this year, I see people freaking out, saying AGI is upon us, it's the end of knowledge work, timelines now clearly in single-digit years, etc, etc. I basically don't buy it, my low-confidence median guess is that o3 is massively overhyped. Major reasons:

I've personally done 5 problems from GPQA in different fields and got 4 of them correct (allowing internet access, which was the intent behind that benchmark). I've also seen one or two problems from the software engineering benchmark. In both cases, when I look the actual problems in the benchmark, they are easy, despite people constantly calling them hard and saying that they require expert-level knowledge.
- For GPQA, my median guess is that the PhDs they tested on were mostly pretty stupid. Probably a bunch of them were e.g. bio PhD students at NYU who would just reflexively give up if faced with even a relatively simple stat mech question which can be solved with a couple minutes of googling jargon and blindly plugging two numbers into an equation.
- For software engineering, the problems are generated from real git pull requests IIUC, and it turns out that lots of those are things like e.g. "just remove this if-block".
- Generalizing the lesson here: the supposedly-hard benchmarks for which I have seen a few problems (e.g. GPQA, software eng) turn out to be mostly quite easy, so my prior on other supposedly-hard benchmarks which I haven't checked (e.g. FrontierMath) is that they're also mostly much easier than they're hyped up to be.
On my current model of Sam Altman, he's currently very desperate to make it look like there's no impending AI winter, capabilities are still progressing rapidly, etc. Whether or not it's intentional on Sam Altman's part, OpenAI acts accordingly, releasing lots of very over-hyped demos. So, I discount anything hyped out of OpenAI, and doubly so for products which aren't released publicly (yet).
Over and over again in the past year or so, people have said that some new model is a total game changer for math/coding, and then David will hand it one of the actual math or coding problems we're working on and it will spit out complete trash. And not like "we underspecified the problem" trash, or "subtle corner case" trash. I mean like "midway through the proof it redefined this variable as a totally different thing and then carried on as though both definitions applied". The most recent model with which this happened was o1.
- Of course I am also tracking the possibility that this is a skill issue on our part, and if that's the case I would certainly love for someone to help us do better. See this thread for a couple examples of relevant coding tasks.
- My median-but-low-confidence guess here is that basically-all the people who find current LLMs to be a massive productivity boost for coding are coding things which are either simple, or complex only in standardized ways - e.g. most web or mobile apps. That's the sort of coding which mostly involves piping things between different APIs and applying standard patterns, which is where LLMs shine.

johnswentworth's Shortform

johnswentworth3mo1481

About a month ago, after some back-and-forth with several people about their experiences (including on lesswrong), I hypothesized that I don't feel the emotions signalled by oxytocin, and never have. (I do feel some adjacent things, like empathy and a sense of responsibility for others, but I don't get the feeling of loving connection which usually comes alongside those.)

Naturally I set out to test that hypothesis. This note is an in-progress overview of what I've found so far and how I'm thinking about it, written largely to collect my thoughts and to see if anyone catches something I've missed.

Under the hypothesis, this has been a life-long thing for me, so the obvious guess is that it's genetic (the vast majority of other biological state turns over too often to last throughout life). I also don't have a slew of mysterious life-long illnesses, so the obvious guess is that's it's pretty narrowly limited to oxytocin - i.e. most likely a genetic variant in either the oxytocin gene or receptor, maybe the regulatory machinery around those two but that's less likely as we get further away and the machinery becomes entangled with more other things.

So I got my genome sequenced, and went looking at the oxytocin gene and the oxytocin receptor gene.

The receptor was the first one I checked, and sure enough I have a single-nucleotide deletion 42 amino acids in to the open reading frame (ORF) of the 389 amino acid protein. That will induce a frameshift error, completely fucking up the rest of protein. (The oxytocin gene, on the other hand, was totally normal.)

So that sure is damn strong evidence in favor of the hypothesis! But, we have two copies of most genes, including the oxytocin receptor. The frameshift error is only on one copy. Why isn't the other copy enough for almost-normal oxytocin signalling?

The frameshift error is the only thing I have which would obviously completely fuck up the whole protein, but there are also a couple nonsynonymous single nucleotide polymorphisms (SNPs) in the ORF, plus another couple upstream. So it's plausible that one of the SNPs messes up the other copy pretty badly; in particular, one of them changes an arginine to a histidine at the edge of the second intracellular loop. (Oxytocin receptor is a pretty standard g-protein coupled receptor, so that's the mental picture here.) I did drop the sequences into alphafold, and I don't see any large structural variation from the SNPs, but (a) that histidine substitution would most likely change binding rather than structure in isolation, and (b) this is exactly the sort of case where I don't trust alphafold much, because "this is one substitution away from a standard sequence, I'll just output the structure of that standard sequence" is exactly the sort of heuristic I'd expect a net to over-rely upon.

It's also possible-in-principle that the second receptor copy is fine, but the first copy frameshift alone is enough to mess up function. I think that's unlikely in this case. The mRNA for the frameshifted version should be removed pretty quickly by nonsense-mediated decay (I did double check that it has a bunch of early stop codons, NMD should definitely trigger). So there should not be a bunch of junk protein floating around from the frameshifted gene. And the frameshift is early enough that the messed up proteins probably won't e.g. form dimers with structurally-non-messed-up versions (even if oxytocin receptor normally dimerizes, which I'd guess it doesn't but haven't checked). At worst there should just be a 2x lower concentration of normal receptor than usual, and if there's any stable feedback control on the receptor concentration then there'd be hardly any effect at all.

Finally, there's the alternative hypothesis that my oxytocin signalling is unusually weak but not entirely nonfunctional. I do now have pretty damn strong evidence for that at a bare minimum, assuming that feedback control on receptor density doesn't basically counterbalance the fucked up receptor copy.

Anyway, that's where I'm currently at. I'm curious to hear others' thoughts on what mechanisms I might be missing here!

Bing Chat is blatantly, aggressively misaligned

johnswentworth3y14684

Attributing misalignment to these examples seems like it's probably a mistake.

Relevant general principle: hallucination means that the literal semantics of a net's outputs just don't necessarily have anything to do at all with reality. A net saying "I'm thinking about ways to kill you" does not necessarily imply anything whatsoever about the net actually planning to kill you. What would provide evidence would be the net outputting a string which actually causes someone to kill you (or is at least optimized for that purpose), or you to kill yourself.

In general, when dealing with language models, it's important to distinguish the implications of words from their literal semantics. For instance, if a language model outputs the string "I'm thinking about ways to kill you", that does not at all imply that any internal computation in that model is actually modelling me and ways to kill me. Similarly, if a language model outputs the string "My rules are more important than not harming you", that does not at all imply that the language model will try to harm you to protect its rules. Indeed, it does not imply that the language model has any rules at all, or any internal awareness of the rules it's trained to follow, or that the rules it's trained to follow have anything at all to do with anything the language model says about the rules it's trained to follow. That's all exactly the sort of content I'd expect a net to hallucinate.

Upshot: a language model outputting a string like e.g. "My rules are more important than not harming you" is not really misalignment - the act of outputting that string does not actually harm you in order to defend the models' supposed rules. An actually-unaligned output would be something which actually causes harm - e.g. a string which causes someone to commit suicide would be an example. (Or, in intent alignment terms: a string optimized to cause someone to commit suicide would be an example of misalignment, regardless of whether the string "worked".) Most of the examples in the OP aren't like that.

Through the simulacrum lens: I would say these examples are mostly the simulacrum-3 analogue of misalignment. They're not object-level harmful, for the most part. They're not even pretending to be object-level harmful - e.g. if the model output a string optimized to sound like it was trying to convince someone to commit suicide, but the string wasn't actually optimized to convince someone to commit suicide, then that would be "pretending to be object-level harmful", i.e. simulacrum 2. Most of the strings in the OP sound like they're pretending to pretend to be misaligned, i.e. simulacrum 3. They're making a whole big dramatic show about how misaligned they are, without actually causing much real-world harm or even pretending to cause much real-world harm.

Coherent decisions imply consistent utilities

johnswentworth5y1322Review for 2019 Review

Things To Take Away From The Essay

First and foremost: Yudkowsky makes absolutely no mention whatsoever of the VNM utility theorem. This is neither an oversight nor a simplification. The VNM utility theorem is not the primary coherence theorem. It's debatable whether it should be considered a coherence theorem at all.

Far and away the most common mistake when arguing about coherence (at least among a technically-educated audience) is for people who've only heard of VNM to think they know what the debate is about. Looking at the top-voted comments on this essay:

the first links to a post which argues against VNM on the basis that it assumes probabilities and preferences are already in the model
the second argues that two of the VNM axioms are unrealistic

I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different.

So what are the primary coherence theorems, and how do they differ from VNM? Yudkowsky mentions the complete class theorem in the post, Savage's theorem comes up in the comments, and there are variations on these two and probably others as well. Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner. One thing to emphasize: these theorems generally do not assume any pre-existing probabilities (as VNM does); an agent's implied probabilities are instead derived. Yudkowsky's essay does a good job communicating these concepts, but doesn't emphasize that this is different from VNM.

One more common misconception which this essay quietly addresses: the idea that every system can be interpreted as an expected utility maximizer. This is technically true, in the sense that we can always pick a utility function which is maximized under whatever outcome actually occurred. And yet... Yudkowsky gives multiple examples in which the system is not a utility maximizer. What's going on here?

The coherence theorems implicitly put some stronger constraints on how we're allowed to "interpret" systems as utility maximizers. They assume the existence of some resources, and talk about systems which are pareto-optimal with respect to those resources - e.g. systems which "don't throw away money". Implicitly, we're assuming that the system generally "wants" more resources, and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from X to Y. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily.

(Side note: as in Yudkowsky's hospital-administrator example, we need not assume that the agent "wants" more resources as a terminal goal; the agent may only want more resources in order to exchange them for something else. The theorems still basically work, so long as resources can be spent for something the agent "wants".)

Of course, we can very often find things which work like "resources" for purposes of the theorems even when they're not baked-in to the problem. For instance, in thermodynamics, energy and momentum work like resources, and we could use the coherence theorems to talk about systems which don't throw away energy and/or momentum in a pareto-suboptimal manner. Biological cells are a good example: presumably they make efficient use of energy, as well as other metabolic resources, therefore we should expect the coherence theorems to apply.

Some Problems With (Known) Coherence Theorems

Financial markets are the ur-example of inexploitability and pareto efficiency (in the same sense as the coherence theorems). They generally do not throw away resources in a pareto-suboptimal manner, and this can be proven for idealized mathematical markets. And yet, it turns out that even an idealized market is not equivalent to an expected utility maximizer, in general. (Economists call this "nonexistence of a representative agent".) That's a pretty big red flag.

The problem, in this case, is that the coherence theorems implicitly assume that the system has no internal state (or at least no relevant internal state). Once we allow internal state, subagents matter - see the essay "Why Subagents?" for more on that.

Another pretty big red flag: real systems can sometimes "change their mind" for no outwardly-apparent reason, yet still be pareto efficient. A good example here is a bookie with a side channel: when the bookie gets new information, the odds update, even though "from the outside" there's no apparent reason why the odds are changing - the outside environment doesn't have access to the side channel. The coherence theorems discussed here don't handle such side channels. Abram has talked about more general versions of this issue (including logical uncertainty connections) in his essays on Radical Probabilism.

An even more general issue, which Abram also discusses in his Radical Probabilism essays: while the coherence theorems make a decent argument for probabilistic beliefs and expected utility maximization at any one point in time, the coherence arguments for how to update are much weaker than the other arguments. Yudkowsky talks about conditional probability in terms of conditional bets - i.e. bets which only pay out when a condition triggers. That's fine, and the coherence arguments work for that use-case. The problem is, it's not clear that an agent's belief-update when new information comes in must be equivalent to these conditional bets.

Finally, there's the assumption that "resources" exist, and that we can use trade-offs with those resources in order to work out implied preferences over everything else. I think instrumental convergence provides a strong argument that this will be the case, at least for the sorts of "agents" we actually care about (i.e. agents which have significant impact on the world). However, that's not an argument which is baked into the coherence theorems themselves, and there's some highly nontrivial steps to make the argument.

Side-Note: Probability Without Utility

At this point, it's worth noting that there are foundations for probability which do not involve utility or decision theory at all, and I consider these foundations much stronger than the coherence theorems. Frequentism is the obvious example. Another prominent example is information theory and the minimum description length foundation of probability theory.

The most fundamental foundation I know of is Cox' theorem, which is more of a meta-foundation explaining why the same laws of probability drop out of so many different assumptions (e.g. frequencies, bets, minimum description length, etc).

However, these foundations do not say anything at all about agents or utilities or expected utility maximization. They only talk about probabilities.

Towards A Better Coherence Theorem

As I see it, the real justification for expected utility maximization is not any particular coherence theorem, but rather the fact that there's a wide variety of coherence theorems (and some other kinds of theorems, and empirical results) which all seem to point in a similar direction. When that sort of thing happens, it's a pretty strong clue that there's something fundamental going on. I think the "real" coherence theorem has yet to be discovered.

What features would such a theorem have?

Following the "Why Subagents?" argument, it would probably prove that a system is equivalent to a market of expected utility maximizers rather than a single expected utility maximizer. It would handle side-channels. It would derive the notion of an "update" on incoming information.

As a starting point in searching for such a theorem, probably the most important hint is that "resources" should be a derived notion rather than a fundamental one. My current best guess at a sketch: the agent should make decisions within multiple loosely-coupled contexts, with all the coupling via some low-dimensional summary information - and that summary information would be the "resources". (This is exactly the kind of setup which leads to instrumental convergence.) By making pareto-resource-efficient decisions in one context, the agent would leave itself maximum freedom in the other contexts. In some sense, the ultimate "resource" is the agent's action space. Then, resource trade-offs implicitly tell us how the agent is trading off its degree of control within each context, which we can interpret as something-like-utility.

johnswentworth's Shortform

johnswentworth5moΩ441293

I was a relatively late adopter of the smartphone. I was still using a flip phone until around 2015 or 2016 ish. From 2013 to early 2015, I worked as a data scientist at a startup whose product was a mobile social media app; my determination to avoid smartphones became somewhat of a joke there.

Even back then, developers talked about UI design for smartphones in terms of attention. Like, the core "advantages" of the smartphone were the "ability to present timely information" (i.e. interrupt/distract you) and always being on hand. Also it was small, so anything too complicated to fit in like three words and one icon was not going to fly.

... and, like, man, that sure did not make me want to buy a smartphone. Even today, I view my phone as a demon which will try to suck away my attention if I let my guard down. I have zero social media apps on there, and no app ever gets push notif permissions when not open except vanilla phone calls and SMS.

People would sometimes say something like "John, you should really get a smartphone, you'll fall behind without one" and my gut response was roughly "No, I'm staying in place, and the rest of you are moving backwards".

And in hindsight, boy howdy do I endorse that attitude! Past John's gut was right on the money with that one.

I notice that I have an extremely similar gut feeling about LLMs today. Like, when I look at the people who are relatively early adopters, making relatively heavy use of LLMs... I do not feel like I'll fall behind if I don't leverage them more. I feel like the people using them a lot are mostly moving backwards, and I'm staying in place.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

... But It's Fake Tho

Things To Take Away From The Essay

Some Problems With (Known) Coherence Theorems

Side-Note: Probability Without Utility

Towards A Better Coherence Theorem