Wiki Contributions

Comments

The upside and downside both seem weak when it's currently so easy to bypass the filters. The probability of refusal doesn't seem like the most meaningful measure, but I agree it'd be good for the user to get explicitly flagged whether the trained non-response impulse was activated or not, instead of having to deduce it from the content or wonder if the AI really thinks that it's non-answer is correct.

Currently, I wouldn't be using any illegal advise it gives me for the same reason I wouldn't be using any legal advise it gives me, the model just isn't smarter than looking things up on google. In the future, when the model is stronger, they're going to need something better than security theatre to put it behind, and I agree more flagging of when those systems are triggered would be good. Training in non-response behaviors isn't a secure method because you can always phrase a request to put it in a state of mind where those response behaviors aren't triggered, so I don't think reenforcing this security paradigm and trying to pretend it works for longer would be helpful.

The regular counterfactual part as I understand it is:
"If I ignore threats, people won't send me threats"
"I am an agent who ignores threats"
"I have observed myself recieve a threat"
You can at most pick 2, but FDT needs all 3 to justify that it should ignoring it.
It wants to say "If I were someone who responds to threats when I get them, then I'll get threats, so instead I'll be someone who refuses threats when I get threats so I don't get threats" but what you do inside of logically impossible situations isn't well defined.

The logical counterfactual part is this:
"What would the world be like if f(x)=b instead of a?"
specifically, FDT requires asking what you'd expect things to be like if FDT outputted different results, and then it outputs the result where you say the world would be best if it outputted that result. The contradictions here is that you can prove what FDT outputs, and so prove that it doesn't actually output all the other results, and the question again isn't well defined.

https://www.lesswrong.com/tag/functional-decision-theory argues for choosing as if you're choosing the policy you'd follow in some situation before you learnt any of the relevant infortmation. In many games, having a policy of making certain choicese (that others could perhaps predict, and adjust their own choices accordingly) gets you better outcomes then just always doing what seems like a good idea ta the time. For example if someone credibly threatens you might be better off paying them to go away, but before you got the threat you would've prefered to commit yourself to never pay up so that people don't threaten you in the first place.

A problem with arguments of the form "I expect that predictably not paying up will cause them not to threaten me" is that at the time you recieve the threat, you now know that argument to be wrong. They've proven to be somebody who still threatens you even though you do FDT, at which point you can simultaneously prove that refusing the threat doesn't work and so you should pay up (because you've already seen the threat) and that you shouldn't pay up for whatever FDT logic you were using before. Behaviour of agents who can prove a contradiction that is directly relevant to their decision function seems undefined. There needs to be some logical structure that lets you pick which information causes your choice, despite having enough in total to derive contradictions.

My alternative solution is that you aren't convinced by the information you see, that they've actually already threatened you. It's also possible you're still inside their imagination as they decide whether to issue the threat. Whenever something is conditional on your actions in an epistemic state without being conditional on that epistemic state actually being valid (such as if someone predicts how you'd respond to a hypothetical threat before they issue it, knowing you'll know it's too late to stop when you get it) then there's a ghost being lied to and you should think maybe you're that ghost to justify ignoring the threat, rather than try to make decisions during a logically impossible situation.

I totally agree we can be coherently uncertain about logical facts, like whether P=NP. FDT has bigger problems then that.

When writing this I tried actually doing the thing where you predict a distribution, and only 21% of LessWrong users were persuaded they might be imaginary and being imagined by me, which is pretty low accuracy considering they were in fact imaginary and being imagined by me. Insisting that the experience of qualia can't be doubted did come up a few times, but not as aggressively as you're pushing it here. I tried to cover it in the "highly detailed internal subjective experience" counterargument, and in my introduction, but I could have been stronger on that.

I agree that the same argument on philosophers or average people would be much less successful even then that, but that's a fact about them, not about the theory.

If you think you might not have qualia, then by definition you don't have qualia.

What? just a tiny bit of doubt and your entire subjective conscious experience evaporates completely? I can't see any mechanism that would do that, it seems like you can be real and have any set of beliefs or be fictional and have any set of beliefs. Something something map-territory distinction?

This just seems like a restatement of the idea that we should act as if we were choosing the output of a computation.

Yes, it is a variant of that idea, with different justifications that I think are more resilient. The ghosts of FDT agents still make the correct choices, they just have incoherent beliefs while they do it.

Actually, I don't assume that, I'm totally ok with believing ghosts don't have qualia. All I need is that they first-order believe they have qualia, because then I can't take my own first-order belief I have qualia as proof I'm not a ghost. I can still be uncertain about my ghostliness because I'm uncertain in the accuracy of my own belief I have qualia, in explicit contradiction of 'cogito ergo sum'. The only reason ghosts possibly having qualia is a problem is that then maybe I have to care about how they feel.

A valid complaint. I know the answer must be something like "coherent utility functions can only consist of preferences about reality" because if you are motivated by unreal rewards you'll only ever get unreal rewards, but that argument needs to be convincing to the ghost too, whose got more confidence in her own reality. I know that e.g. in Bomb ghost-theory agents choose the bomb even if they think the predictor will simulate them a painful death, because they consider the small amount of money at much greater measure for their real selves to be worth it, but I'm not sure how they get to that position.

They can though? Bomb box 5, incentivise box 1 or 2, bet on box 3 or 4. Since FDT's strategy puts rewarding cooperative hosts above money or grenades, she picks the box that rewards the defector and thus incentivises all 4 to defect from that equilibrium. (I thought over her strategy for quite a bit and there are probably still problems with it but this isn't one of them)

How much of this effect is from morality being causally contagious (associating with Evil people turns you Evil) vs. morality being evidientarily contagious (Evil people are more likely to choose to associate with Evil people)?

I'd expect that, all else being equal, organisations secretly run in evil ways will be more willing to secretly accept money from other evil people, for many reasons including that they've got higher expectation of how normal that sort of behaviour is. It seems harder to imagine how a good organisation choosing to take dirty money would corrupt itself in the process if it was being reasonably diligent. Even if the moral contagion argument is wrong from inside hypothetical-good-MITs perspective, and so they should take the money, from everyone elses perspective it's still information we can update on.

If taking bad money for good causes is first-order good, because you're doing good things with it, but other donors can notice and it lowers their confidence in how good you are (since bad causes are more willing to take bad money), then you might lose other support sufficient to make it not worthwhile. There's probably some sort of signalling equilibria here, which is completely destroyed by the whole concept of accepting the money in secret. Hopefully actually good organisations wouldn't do that sort of deontology violation and would just make their donor lists public?

If the hosts move first logically, then TDT will lead to the same outcomes as CDT, since it's in each host's interest to precommit to incentivising the human to pick their own box

It's in the hosts interests to do that if they think the player is CDT, but it's not in their interests to commit to doing that. They don't lose anything by retaining the ability to select a better strategy later after reading the players mind.

Load More