I'm excited by this post, because I think you're onto something big and getting it almost right. Strong upvote!
Let me summarize what I'm reading here and you can tell me if I'm understanding correctly.
I'll start with where I agree. Then give a worked example where I think your details aren't quite right, then gesture at the general direction.
Consciously tracking what you're actually optimizing for is one example skill which seems worth developing despite its time cost
This is quite the understatement. You're talking about inner alignment. This is the skill for rationality IMO, and has strong connections to the big problem of the day once you start to notice it -- and I see from your next post that you have.
Say I'm explaining to someone why they're wrong about X. I "want" both of us to get to the truth; but I'm locally intuitively trying to get them to change their mind. They're wrong, after all. I'm speaking some words I think will get them to believe not X, but Y...
[...]
So now I'm optimizing, intuitively but not explicitly, to change this person's beliefs. What do you expect I'll feel if they give a strong counterargument?
I'll feel defensive. I don't endorse this feeling, but my short-term intuitive goal is beset!
This is exciting to see because I rarely see people diagnose the core problem so clearly, but the details are where you're going to struggle in implementation.
A bit of backstory, I found this stuff from a different angle -- by using hypnosis as a starting point to understand how this all works and working up from the "bit banging" towards a more generalizable model of what's going on. This "Trying to explain to someone they're wrong, but it doesn't work" is actually the opening example/challenge I give for my sequence on the topic.
You're right that we aren't actually optimizing for "getting both of us to the truth" and that this is why we don't progress smoothly towards the truth. The problem is that "trying to get them to change their mind" isn't a sufficiently complete picture either.
We're trying to get them to change their mind conditional on us being right -- in other words, trying to assert our perspective at them, expecting this to change their mind. We're not drawn towards any possibilities that involve us even examining whether we could be wrong, even if this would be the persuasive thing to do -- as evidenced by the predictable defensiveness should we be pressed in that direction.
If we were actually optimizing towards "change their mind", we'd notice real fast that our own defensiveness gets in the way. Because we'd notice that our first impulses aren't going to succeed at "changing their mind", and wonder why that is and what we can do to fix it. And the obvious answer, once we want to find it, is because we're coming off as closed minded and not open to finding out if we're actually right.
Which makes sense, because what do we do when people assert their perspective at us, expecting it to change our mind, without even considering what they might be missing? Okay, so the way to change their mind is to demonstrate open mindedness and -- shit, yeah, that feels aversive to me. Okay, what now?
What we're locally optimizing for is even less flattering. We're trying to push away their disagreement. We're trying to batter it with arguments and reason until it stops reminding us that our expectations we're placing on the other persons beliefs are wrong. This can kinda sorta work, when social pressure is enough for suppression, but it never actually changes their beliefs let alone helps us find reality.
There are a lot of moving pieces to sort out, which is why I didn't close the loop on the opening example until the capstone post, but the take away is that aligning our local optimization with our conception of what we're optimizing for brings both towards "finding truth" and, as a result of bringing us both towards truth, conditional on us being as right as we thought we were, convincing the other person.
So like, expect to actually change their mind if you're so right, and then notice what happens.
But what would it mean to run the negative query "what would it feel like if my current model of this situation were false?"
If I ask "what explains this under my current model?", my mind has something to roll out; how's the simulator supposed to run "what explains not(this) under my current model?"
Right, it's a confusing question to grapple with. It seems unproductive because "If I were wrong" is assumed, not evidenced, so it could go anywhere. Then what am I using to predict? How do I choose which bits to pretend aren't actually real, in my best guess?
Fortunately, we don't have to assume, and we don't even have to notice confusion.
We just have to notice wrongness. Dissatisfaction with reality. Dukkha. Prediction error. It's everywhere.
When we're frustrated that the other person isn't yielding to our obvious rightness, why? We're expecting them to believe us because we're right, and they're not believing us. Our expectation is being empirically falsified, and the path towards becoming less wrong is to start wondering why that is.
We'll often end up at places like "Because they're DumbBad, that's why", which, okay, fine. Maybe. But then why were we expecting them to listen to our fact and logic? Is that what DumbBad people do? Until all feels right in the world, we're wrong about something, and that feeling of "not-rightness" is the clue as to where our predictions have diverged from reality.
We might be patting ourselves on the back when we get things that we predict should be rewarding, but unless we're actually right reality is going to be slapping us in the face about it sooner or later. We might misinterpret the slap at first, but if we keep going it'll lead us back to ground. And we can update on predicted slaps before they happen, even, and do a lot of the work ahead of time before we actually do anything wrong.
Confusion is what it feels like to notice that our models are insufficient. There won't be confusion to notice until we notice that our models are insufficient, and it's noticing "wrongness" that gets us there.
Were I conscious of my intuitive goal instead of merely tracking it,
How do you define "conscious" vs "merely tracking"? Like, what's a rough way to explain the most important differences between them?
Intuitively, to me, "tracking" is synonymous with "being conscious of"; I wonder if you'retrying to point at some specific category of qualia or what :D
Steven Byrnes has written quite a lot on brain-like AGI algorithms. I'll reiterate here of a small part of his work, but you'd be better off reading his stuff directly.
For the ideas which inspired me, see here, here, and here.
This post has good handles for intuitively applying neuroscience to rationality. Don't mistake my "simulator" cluster for a distilled and reduced concept; it's "lumpy rock", not "skewed prism of seafloor impactite from the late Devonian".
My (loose) extension of Steve's models of human minds predicts that we'll get stuck in local minima of self-reinforcing nonreality; with this knowledge, we can more directly target the underlying mechanisms of cognitive biases.
Consciously tracking what you're actually optimizing for is one example skill which seems worth developing despite its time cost. In fact, many rationality skills (like noticing your confusion and scout mindset) are species of this overlying stance.
Let's take scout mindset as an example.
While discussing something, the goal for rationalists usually isn't to convince the other person of their ideas, but rather to come to truthier beliefs.
One should therefore be vigilant, lest they fall into a combative posture, since this makes them slow to update.
Perhaps modern neuroscience can help this person be rewarded by rational thoughts? Let me just...
Simulators
Brains run simulations. I've not seen Steve claim this[2], hence "simulations" instead of "thought predictors".
Let's say that I go parasailing in the Andes. During flight, my mind absorbs tons of data about the sky, landscape, my gear, etc. When I get home, my brain runs lots of internal simulations about flying through the mountains, birds, and more specific stuff about, say, wind patterns. These simulations are based off of those earlier sensory experiences.
In this way, my mind converts the tidbits of data in my flight memories into lots of training examples for itself. To a Solomonoff inductor (an imagined perfect intelligence), my simulations don't add any extra information about my flight compared to my memories.
But I'm not a Solomonoff inductor, so my mind needs many self-generated training examples to fully internalize it.
(This is one reason we're so damn sample-efficient compared to LLMs. We can keep generating internal training data. LLMs, by comparison, have to be fed training data by some externally bottlenecked process like internet scraping or human-written RL environments.)
From what I see, LessWrong rationality advice is almost entirely "ingest this particular training example", which works great if your simulators are faithful to the content of that data. But they're not.
Simulator dynamics cause cognitive biases and actively prevent you from fixing them.
Drift; cognitive attractors
Simulators can be influenced by each other cyclicly; they're not straightforwardly truthful updates on sensory data. As an extreme example, these internal thoughts[3]:
Now I like my friend less! He did absolutely nothing to imply that he'd attack me with a frying pan. In fact, I had a stomache cramp I didn't notice, which biased my idle thoughts towards pain, and my friend happened to be on my mind because I visited him that morning.
A more typical example for the audience of this post:
Cognitive simulators are a dynamical system, in the same sense as weather systems and Conway's Game of Life. At least in humans, simulators can loop back on themselves and/or influence parallel simulators such that they fall into attractor basins.
When a simulator predicts a thought/action X will be rewarded, X-promoters actually do get rewarded. Then future simulations will run "X is rewarding", pinning you in a cycle.
Like water running down a hill, minds can get stuck in local puddles.
What are you optimizing in your head
From "What Are You Tracking In Your Head"[4]:
Do you track what you're locally optimizing for? Are you intuitively aware of whatever queries/searches you're running, all of the time?
Lots of biases don't feel wrong (to an untrained person) because they don't realize they're running the wrong search. Their mind is aiming for something completely different from what they "think" they're doing. If you ask such a person what they're trying to do, they'll describe something different from what they'd feel by introspecting during the process.
Say I'm explaining to someone why they're wrong about X. I "want" both of us to get to the truth; but I'm locally intuitively trying to get them to change their mind. They're wrong, after all. I'm speaking some words I think will get them to believe not X, but Y...
So now I'm optimizing, intuitively but not explicitly, to change this person's beliefs. What do you expect I'll feel if they give a strong counterargument?
I'll feel defensive. I don't endorse this feeling, but my short-term intuitive goal is beset!
Were I conscious of my intuitive goal instead of merely tracking it, I'd notice the swap. This holds for many classes of error you can find in the Sequences.[5]
Social pulls
Humans have quite strong social drives. This is more of a consequence of evolutionarily specified drives than of architectural inefficiencies, so training tactics differ a bit.
Here again, had I been conscious of my immediate optimization target, I would've noticed the prestige-drive pulling me away from good work.
But this goes much deeper. Down to the felt sense of what's culturally acceptable/expected by the people around me, and in which ways I care.
Probably surrounding myself with people who think very clearly would help. But that leaves a weird residual mismatch; I don't immediately want to be The Rationalistist, even if my milieu would very much respect that.
No, I want to look like someone with excellent epistemology, not to mention my thousand other evolved desires. Culture is only a partial solution, because I still need to actively guide my thought-drift.
Negative queries are unnatural (to most people)
"Try to falsify your beliefs" is is good advice, but notice that it's a very strange operation to ask of a simulator architecture.
When I run a query like "why is the 3d printer I just bought not working?" my mind can start rolling out possible explanations: clogged nozzle, disconnected wire etc. I have a problem, so my mind searches for trajectories which would explain my experience. Same for social situations, research, driving. I have some current problem and my mind searches forward through explanations and possible actions.
But what would it mean to run the negative query "what would it feel like if my current model of this situation were false?"
Simulators run forward, generating possible continuations from some starting frame. If I ask "what explains this under my current model?", my mind has something to roll out; how's the simulator supposed to run "what explains not(this) under my current model?"
Maybe I spend a lot of time imagining what the inverse shape of each of my models feels like. A priori this seems very wasteful to me, as it's swimming upstream against your hardwired cognitive architecture. But I could imagine it working out.
Confusion, however, is something on which we can run positive queries, and it's trainable. "What am I confused about" slowly but predictably clarifies where reality is biting your models, so long as you're actually optimizing for understanding your confusion.
See Taste & Shaping.
Engram replay seems to be doing something like "running simulations", though much of that is probably also "optimizing the representation of episodic memories for predictive efficiency". Like defragmenting a hard disk drive, but neuromorphic.
I noticed this sort of spiral all the time as a young teenager.
Explicitly citing quote here, not an example monologue.
I might just be saying in more words "deliberately practice rationality", but "deliberate practice" doesn't, to me, properly describe the metacognitive pose one should take therein.