I'd have liked if @johnswentworth has responded to Eliezer's comment at more length, though maybe he did and I missed? Still, giving this post a 4 in the review, same as the other.
Done.
It's now the 2024 year-in-review, and Ruby expressed interest in seeing me substantively respond to this comment, so here I am responding a year and a half later.
The concept of "easy to understand" seems like a good example, so let's start there. My immediate reaction to that example is "yeah duh of course an AI won't use that concept internally in a human-like way", followed by "wait what the heck is Eliezer picturing such that that would even be relevant?", followed by "oh... probably he's used to noobs anthropomorphizing AI all the time". There is an argument to be made that "easy for a human to understand" is ontologically natural in our environment for predicting humans, even for nonhuman minds insofar as they're modelling humans (though I'm not confident). But I wouldn't expect an AI to use that concept internally the way a human does, even if the concept is convergent for predicting humans. A similar argument would apply to lots of other reflective concepts: the human version would be (potentially) convergent for modelling humans, but the AI will use that concept in a very different way internally than humans would. That means we can't rely on our anthropomorphizing intuitions to reason about how the AI will use such concepts. But we could still potentially rely on such concepts for alignment purposes so long as we're not e.g. expecting the AI to magically care about them or use them in human-like ways. I would guess Eliezer mostly agrees with that, though probably we could go further down this hole before bottoming out.
But I don't think that line of discussion is all that interesting. The immediate use-cases for a strong AI, like e.g. running uploaded humans, really shouldn't require a bunch of reflective concepts anyway. The main things I want to immediately do with AI, as a first use-case, tend to look like hard science and engineering, not like reasoning about humans or doing anthropomorphic things. I usually don't even imagine interfacing to the thing in natural language.
There is another interesting delta here, one which developed in the intervening year-and-a-half since Eliezer wrote this comment. I think corrigibility is maybe not actually a reflective concept.
Eliezer's particular pointer to corrigibility is extremely reflective:
The "hard problem of corrigibility" is interesting because of the possibility that it has a relatively simple core or central principle - rather than being value-laden on the details of exactly what humans value, there may be some compact core of corrigibility that would be the same if aliens were trying to build a corrigible AI, or if an AI were trying to build another AI.
…
We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.
(source) But Eliezer comes at corrigibility very indirectly. In principle, there is some object-level stuff an AI would do when building a powerful sub-AI while prone to various sorts of errors. That object-level stuff is not itself necessarily all that reflective. The move of "we're talking about whatever stuff an AI would do when building a powerful sub-AI while prone to various sorts of errors" is extremely reflective, but in principle one might identify the object-level stuff via some other method which needn't be that reflective.
In particular, Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals gestures at a very-ontologically-natural-looking concept which sure does seem to overlap an awful lot with corrigibility, to the point where it maybe just is corrigibility (insofar as corrigibility actually has a simple core at all).
LLMs mimic human text. That is the first and primary thing they are optimized for. Humans motivatedly reason, which shows up in their text. So, LLMs trained to mimic human text will also mimic motivated reasoning, insofar as they are good at mimicking human text. This seems like the clear default thing one would expect from LLMs; it does not require hypothesizing anything about motivated reasoning being adaptive.
Once one learns to spot motivated reasoning in one's own head, the short term planner has a much harder problem. It's still looking for outputs-to-rest-of-brain which will result in e.g. playing more Civ, but now the rest of the brain is alert to the basic tricks. But the short term planner is still looking for outputs, and sometimes it stumbles on a clever trick: maybe motivated reasoning is (long-term) good, actually? And then the rest of the brain goes "hmm, ok, sus, but if true then yeah we can play more Civ" and the short term planner is like "okey dokey let's go find us an argument that motivated reasoning is (long-term) good actually!".
In short: "motivated reasoning is somehow secretly rational" is itself the ultimate claim about which one would motivatedly-reason. It's very much like the classic anti-inductive agent, which believes that things which have happened more often before are less likely to happen again: "but you've been wrong every time before!" "yes, exactly, that's why I'm obviously going to be right this time". Likewise, the agent which believes motivated reasoning is good actually: "but your argument for motivated reasoning sure seems pretty motivated in its own right" "yes, exactly, and motivated reasoning is good so that's sensible".
... which, to be clear, does not imply that all arguments in favor of motivated reasoning are terrible. This is meant to be somewhat tongue-in-cheek; there's a reason it's not in the post. But it's worth keeping an eye out for motivated arguments in favor of motivated reasoning, and discounting appropriately (which does not mean dismissing completely).
There is definitely a standard story which says roughly "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans". I do not think that story stands up well under examination; when I think of standard day-to-day examples of motivated reasoning, that pattern sounds like a plausible explanation for some-but-a-lot-less-than-all of them.
For example: suppose it's 10 pm and I've been playing Civ all evening. I know that I should get ready for bed now-ish. But... y'know, this turn isn't a very natural stopping point. And it's not that bad if I go to bed half an hour late, right? Etc. Obvious motivated reasoning. But man, that motivated reasoning sure does not seem very socially-oriented? Like, sure, you could make up a story about how I'm justifying myself to an imaginary audience or something, but it does not feel like one would have predicted the Civ example in advance from the model "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans".
Another class of examples: very often in social situations, the move which will actually get one the most points is to admit fault and apologize. And yet, instead of that, people instinctively spin a story about how they didn't really do anything wrong. People instinctively spin that story even when it's pretty damn obvious (if one actually stops to consider it) that apologizing would result in a better outcome for the person in question. Again, you could maybe make up some story about evolving suboptimal heuristics, but this just isn't the behavior one would predict in advance from the model "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans".
A pattern with these examples (and many others): motivated reasoning isn't mainly about fooling others, it's about fooling oneself. Or at least a part of oneself. Indeed, there's plenty of standard wisdom along those lines: "the easiest person to fool is yourself", etc.
Here's a model which I think much better matches real-world motivated reasoning. (Note, however, that all the above critique still stands regardless of whether this next model is correct.)
Motivated reasoning simply isn't adaptive. Even in the ancestral environment, motivated reasoning decreased fitness. It appeared in the first place as an accidental side-effect of an overall-beneficial change in human minds relative to earlier minds, and that change was recent enough that evolution hasn't had time to fix the anti-adaptive side effects.
There's more than one hypothesis for what that change could be. Probably something to do with some particular functions within the mind being separated into parts, so that one part of the mind can now sometimes "cheat" by trying to trick another part of the mind, but the separation of those two functions is still overall beneficial.
An example falsifiable prediction of this model: other animals generally do not motivatedly-reason. If the relevant machinery had been around for very long, we would have expected evolution to fix the problem.
Tutoring.
This answer generated by considering what one can usefully hire other people to do full time, for oneself (or one's family) alone.
Just got around to reading this, and I found it helpful. Thank you.
This was a cool post! I was familiar with f-divergences as a generalization of KL divergence, and of course familiar with maxent methods, but I hadn't seen the two put together before.
Problem: I'm left unsure when or why I would ever want to use this machinery. Sufficient statistics are an intuitive concept, and I can look around at the world and make guesses about where I would want to use sufficient statistics to model things. Independence is also an intuitive concept, and I can look around the world and make guesses about where to use that. Combining those two, I can notice places in the world where (some version of) KPD should bind, and if it doesn't bind then I'm surprised.
But I don't see how to notice places where max-f-divergence distributions should bind to reality. I can see where sufficient statistics should exist in the world, but that's not a sufficient condition for a max-f-divergence distribution; there are (IIUC) lots of other distributions for which sufficient statistics exist. So why focus on the max-f-divergence class specifically? What intuitive property of (some) real-world systems nails down that class of distributions? Maybe some kind of convexity condition on the distribution or on updates or something?
A practical roadblock is that the above numerical results for inference are terribly slow to compute...
Not sure exactly what you're doing numerically, but here's how I usually handle vanilla maxent problems numerically (in my usual notation, which is not quite the same as yours, apologies for putting marginal translation work on you). We start with
maxent subject to
Transform that to
maxent subject to
This gives a dual problem for the partition function, which I'll write in full:
min
That's an unconstrained optimization problem, it's convex in the happy direction, and standard optimizers (e.g. scipy using jax for derivatives) can usually handle it very quickly.
I would guess that process generalizes just fine to other f-divergences, since it's basically just relying on convex optimization tricks. And that should yield quite fast numerics.
A barrier for : suppose X and Y are both bitstrings of length 2k. The first k bits of the two strings are equal (i.e. X[:k] == Y[:k] in python notation); the rest are independent. Otherwise, all bits are maxentropic (i.e. IID 50/50 coinflips).
Then there's an exact (deterministic) natural latent: := X[:k] = Y[:k]. But I(X; Y), H(X|Y), and H(Y|X) are all much larger than zero; each is k bits.
"On Green" is one of the LessWrong essays which I most often refer to in my own thoughts (along with "Deep Atheism", which I think of as a partner to "On Green"). Many essays I absorb into my thinking, metabolize the contents, but don't often think of the essay itself. But this one is different. The essay itself has stuck in my head as the canonical pointer to... well, Green, or whatever one wants to call the thing Joe is gesturing at.
When I first read the essay, my main thought was "Wow, Joe did a really good job pointing to a thing that I do not like. Screw Green. And thank you Joe for pointing so well at this thing I do not like.". And when the essay has come up in my thoughts, it's mostly been along similar lines: I encounter something in the wild, I'm like "ugh screw this thing... wait, why am I all 'ugh' at this, and why do other people like it?... oh it's Carlsmith!Green, I see".
But rereading the essay today, I'm struck by different thoughts than a year ago.
Green = Oxytocin Feeling?
Let's start with the headline gesturing:
Today, reading that, I think "oxytocin?". That association probably isn't obvious to most of you; oxytocin is mostly known as the hormone associated with companionate love, which looks sort of like wholesomeness and happy villages but not so much like nature. So let me fill in the gap here.
Most people feel companionate love toward their family and (often metaphorical) village. Some people - usually more hippie-ish types - report feeling the same thing towards all of humanity. Even more hippie-ish types go further, feeling the same thing toward animals, or all living things, or the universe as a whole. Generally, the phrase which cues me in is usually "deep connection" - when people report feeling a deep connection to something, that usually seems to cash out to the oxytocin-feeling.
So when I read the post, I think "hmm, maybe Joe is basically gesturing at the oxytocin-feeling, i.e. companionate love". That sure does at least rhyme with a lot of other pieces of the post.
... and that sure would explain why I get a big ol' "ugh" toward Green; I already knew I'm probably genetically unable to feel oxytocin. To me, this whole Green thing is a cluster of stuff which seem basically-unremarkable in their own right, but for some reason other people go all happy-spiral about the stuff and throw away a bunch of their other values to get more Green.
Is Green instrumentally worthwhile?
That brings us to the central topic of the essay: is there a case for more Green which doesn't require me to already value Green in its own right? Given that I don't feel oxytocin, would it nonetheless be instrumental to the goals I do have to embrace a bit more Green?
My knee-jerk reaction to deep atheism is basically "yeah deep atheism is just straightforwardly correct". When Joe says "Green, on its face, seems like one of the main mistakes" I'm like "yes, Green is one of the main mistakes, it still looks like that under its face too, it's just straightforwardly a mistake". It indeed looks like a conflation between is and ought, like somebody's oxytocin-packed value system is trying to mess with their epistemics, like they're believing that the world is warm and fuzzy and caring because they have oxytocin-feelings toward the world, not because the world is actually helping them.
... but one could propose that normal human brains, with proper oxytocin function, are properly calibrated about how much to trust the world. Perhaps my lack of oxytocin biases me in a way which gives me a factually-incorrect depth of atheism.
But that proposal does not stand up to cursory examination. The human brain has pretty decent general purpose capability for figuring out instrumental value; why would that machinery be wrong on the specific cluster of stuff oxytocin touches? It would be really bizarre if this one apparent reward-system hack corrects an epistemic error in a basically-working general purpose reasoning machine. It makes much more sense that babies and families just aren't that instrumentally appealing, so evolution dropped oxytocin into the reward system to make babies and families very appealing as a terminal value. And that hack then generalized in weird ways, with people sometimes loving trees or the universe, because that's what happens when one drops a hacky patch into a reward function.
The upshot of all that is... this Green stuff really is a terminal-ish value. And y'know, if that's what people want, I'm not going to argue with the utility function (as the saying goes). But I am pretty darn skeptical of attempts to argue instrumental necessity of Green, and I do note that these are not my values.