Thanks for the detailed response.
To be honest, I’ve been persuaded that we disagree enough in our fundamental philosophical approaches, that I’m not planning to deeply dive into infrabayesianism, so I can’t respond to many of your technical points (though I am planning to read the remaining parts of Thomas Larson’s summary and see if any of your talks have been recorded).
“However, CDT and EDT are both too toyish for this purpose, since they ignore learning and instead assume the agent already knows how the world works, and moreover this knowledge is represented in the preferable form of the corresponding decision theory” - this is one insight I took from infrabayesianism. I would have highlighted this in my comment, but I forgot to mention it.
“ Learning requires that an agent that sees an observation which is impossible according to hypothesis H, discards hypothesis H and acts on the other hypotheses in has” - I have higher expectations from learning agents - that they learn to solve such problems despite the difficulties.
I’m curious why you say it handles Newcomb’s problem well. The Nirvana trick seems like an artificial intervention where we manually assign certain situations a utility of infinity to enforce a consistent condition which then ensures they are ignored when calculating the maximin. If we are manually intervening, why not just manually cross out the cases we wish to ignore, instead of adding them with infinite value then immediately ignoring them.
Just because we modelled this using infrabayesianism, it doesn’t follow that it contributed anything to the solution. It feels like we just got out what we put in, but that this is obscured by a philosophical shell game. The reason why it feels compelling is though we’re only adding in an option to then immediately ignore it, this is sufficient to give us a fake sense of having made a non-trivial decision.
It would seem that infrabayesianism might be contributing to our understanding of the problem if the infinite utility arose organically, but as far as I can tell, this is a purely artificial intervention.
I think this is made clearer by Thomas Larson’s explanation of infrabayesianism failing Transparent Newcomb’s. It seems clear to me that this isn’t an edge case; instead it demonstrates that rather than solving counterfactuals, all this trick does is give you back what you put in (one-boxing in the case where you see proof you one-box, two-boxing in the case where you see proof you two-box).
(Vanessa claims to have a new intervention that makes the Nirvana trick redundant, if this doesn’t fall prey to the same issues, I’d love to know)
I’d encourage you to do that.
I wish Eliezer had been clearer on why we can’t produce an AI that internalises human morality with gradient descent. I agree gradient descent is not the same as a combination of evolutionary learning + within lifetime learning, but it wasn’t clear to me why this meant that no combination of training schedule and/or bias could produce something similar.
Why do you believe that “But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended”?
I understand why this could cause the AI to fail, but why might it learn incorrect heuristics?
I’m discussing an agent that does in fact take 5 which imagines taking 10 instead. There have been some discussions of decision theory using proof-based agents and how they can run in spurious counterfactual. If you’re confused, you can try searching the archive of this website. I tried earlier today, but couldn’t find particularly good resources to recommend. I couldn’t find a good resource for playing chicken with the universe either.
(I may write a proper article at some point in the future to explain these concepts if I can’t find an article that explains them well)
I’ve been reading a lot of shared theory and finding fascinating, even though I’m not convinced that a more agentic subcomponent wouldn’t win out and make the shards obsolete. Especially if you’re claiming the more agentic shards would seize power, then surely an actual agent would as well?
I would love to hear why Team Shard believes that the shards will survive in any significant form. Off the top of my head, I’d guess their explanation might relate to how shards haven’t disappeared in humans?
On the other hand, I much more strongly endorse the notion of thinking of a reward as a chisel rather than something we’ll likely find a way of optimising.
Are there any agendas you would particularly like to see distilled?
What does it mean for a constraint to be low-dimensional?
I would love to see more experimentation here to determine whether GPT4 can do more complicated quines that are less likely to be able to be copied. For example, we could insist that it includes a certain string or avoids certain functions.