Lovely! Yeah, that rhymes and scans well enough for me!
Here are my experiments; they're pretty good, but I don't count them as "reliably" scanning. So I think I'm gonna count this one as a win!
(I haven't tried testing my chess prediction yet, but here it is on ASCII-art mazes.)
I found this lens very interesting!
Upon reflection, though, I begin to be skeptical that "selection" is any different from "reward."
Consider the description of model-training:
To motivate this, let's view the above process not from the vantage point of the overall training loop but from the perspective of the model itself. For the purposes of demonstration, let's assume the model is a conscious and coherent entity. From it's perspective, the above process looks like:
- Waking up with no memories in an environment.
- Taking a bunch of actions.
- Suddenly falling unconscious.
- Waking up with no memories in an environment.
- Taking a bunch of actions.
- and so on.....
The model never "sees" the reward. Each time it wakes up in an environment, its cognition has been altered slightly such that it is more likely to take certain actions than it was before.
What distinguishes this from how my brain works? The above is pretty much exactly what happens to my brain every millisecond:
Why say that I "see" reward, but the model doesn't?
Is it cheating to say this? I don't think so. Both I and GPT-3 saw the sentence "Paris is the capital of France" in the past; both of us had our synapse weights tweaked as a result; and now both of us can tell you the capital of France. If we're saying that the model doesn't "have memories," then, I propose, neither do I.
I was trying to say that the move used to justify the coin flip is the same move that is rejected in other contexts
Ah, that's the crucial bit I was missing! Thanks for spelling it out.
Reflectively stable agents are updateless. When they make an observation, they do not limit their caring as though all the possible worlds where their observation differs do not exist.
This is very surprising to me! Perhaps I misunderstand what you mean by "caring," but: an agent who's made one observation is utterly unable[1] to interact with the other possible-worlds where the observation differed; and it seems crazy[1] to choose your actions based on something they can't affect; and "not choosing my actions based on X" is how I would define "not caring about X."
Aside from "my decisions might be logically-correlated with decisions that agents in those worlds make (e.g. clone-prisoner's-dilemma)," or "I am locked into certain decisions that a CDT agent would call suboptimal, because of a precommitment I made (e.g. Newcomb)" or other fancy decision-theoretic stuff. But that doesn't seem relevant to Eliezer's lever-coin-flip scenario you link to?
Yeah, if you have a good enough mental index to pick out the relevant stuff, I'd happily take up to 3 new bounty-candidate links, even though I've mostly closed submissions! No pressure, though!
Thanks for the links!
I paid a bounty for the Shard Theory link, but this particular comment... doesn't do it for me. It's not that I think it's ill-reasoned, but it doesn't trigger my "well-reasoned argument" sensor -- it's too... speculative? Something about it just misses me, in a way that I'm having trouble identifying. Sorry!
You win both of the bounties I precommitted to!