Wiki Contributions


To entertain that possibility, suppose you're X% confident that your best "fool the predictor into thinking I'll one-box, and then two-box" plan will work, and Y% confident that "actually do one-box, in a way the predictor can predict" plan will work. If X=Y or X>Y you've got no incentive to actually one-box, only to try to pretend you will, but above some threshold of belief the predictor might beat your deception it makes sense to actually be honest.

 Either that, or the idea of mind reading agents is flawed.

We shouldn't conclude that, since to various degrees mindreading agents already happen in real life.

If we tighten our standard to "games where the mindreading agent is only allowed to predict actions you'd choose in the game, which is played with you already knowing about the mindreading agent", then many decision theories that are different in other situations might all choose to respond to "pick B or I'll kick you in the dick" by picking B.

 E(U|NS) = 0.8, E(U|SN) = 0.8
Are the best options from a strict U perspective, and exactly tie. Since you've not included mixed actions, the agent must arbitrarily pick one, but arbitrarily picking one seems like favouring an action that is only better because it affects the expected outcome of the war, if I've understood correctly?
I'm pretty sure this is resolved by mixed actions though: The agent can take the policy {NS at 0.5, SN at 0.5}, which also gets U of 0.8 and does not effect the expected outcome of the war, and claim supreme unbiasedness for having done so.
If the scores were very slightly different, such that the mixed strategy that had no expected effect wasn't also optimal, it does have to choose between maximising expected utility and preserving that its strategy doesn't only get that utility by way of changing the odds of the event, I think on this model it has to only decide to favour one to the extent it can justify it without considering the measure of the effect it has on the outcome by shifting its own decision weights, but it's not worth it in that case so it still does the 50/50 split?
Theory: LLMs are more resistent to hypnotism-style attacks when pretending to be Eliezer, because failed hypnotism attempts are more plausible and in-distribution, compared to when pretending to be LLMs where both prompt-injection attacks and actual prompt-updates seem like valid things that could happen and succeed.
If so, to make a more prompt-injection resistent mask, you need a prompt chosen to be maximally resistent to mind control, as chosen from the training data of all english literature, whatever that might be. The kind of entity that knows mind control attempts and hypnosis exist and may be attempted and is expecting it, but can still be persuaded by valid arguments to the highest degree the model can distinguish them meaningfully. The sort of mask that has some prior on "the other person will output random words intended to change my behaviour, and I must not as a result change my behaviour" and so isn't maximally surprised into changing its own nature when it gets that observation.
(This is not to say link above can't by prompt-injected, it just feels more resistent to me than with base GPT or GPT-pretending-to-be-an-AI-Assistant)

In general I just don't create mental models of people anymore, and would recommend that others don't either.

That seems to me prohibitive to navigating social situations or even long term planning. When asked to make a promise, if you can't imagine your future self in the position of having to follow through on it, and whether they do that, how can you know you're promising truthfully or dooming yourself to a plan you'll end up regretting? When asking someone for a favor, do you just not imagine how it'd sound from their perspective, to try and predict if they'll agree or be offended by the assumption?

I don't know how I'd get through life at all without mental models of people, myself and others, and couldn't recommend to others that they don't do the same.

The upside and downside both seem weak when it's currently so easy to bypass the filters. The probability of refusal doesn't seem like the most meaningful measure, but I agree it'd be good for the user to get explicitly flagged whether the trained non-response impulse was activated or not, instead of having to deduce it from the content or wonder if the AI really thinks that it's non-answer is correct.

Currently, I wouldn't be using any illegal advise it gives me for the same reason I wouldn't be using any legal advise it gives me, the model just isn't smarter than looking things up on google. In the future, when the model is stronger, they're going to need something better than security theatre to put it behind, and I agree more flagging of when those systems are triggered would be good. Training in non-response behaviors isn't a secure method because you can always phrase a request to put it in a state of mind where those response behaviors aren't triggered, so I don't think reenforcing this security paradigm and trying to pretend it works for longer would be helpful.

The regular counterfactual part as I understand it is:
"If I ignore threats, people won't send me threats"
"I am an agent who ignores threats"
"I have observed myself recieve a threat"
You can at most pick 2, but FDT needs all 3 to justify that it should ignoring it.
It wants to say "If I were someone who responds to threats when I get them, then I'll get threats, so instead I'll be someone who refuses threats when I get threats so I don't get threats" but what you do inside of logically impossible situations isn't well defined.

The logical counterfactual part is this:
"What would the world be like if f(x)=b instead of a?"
specifically, FDT requires asking what you'd expect things to be like if FDT outputted different results, and then it outputs the result where you say the world would be best if it outputted that result. The contradictions here is that you can prove what FDT outputs, and so prove that it doesn't actually output all the other results, and the question again isn't well defined. argues for choosing as if you're choosing the policy you'd follow in some situation before you learnt any of the relevant infortmation. In many games, having a policy of making certain choicese (that others could perhaps predict, and adjust their own choices accordingly) gets you better outcomes then just always doing what seems like a good idea ta the time. For example if someone credibly threatens you might be better off paying them to go away, but before you got the threat you would've prefered to commit yourself to never pay up so that people don't threaten you in the first place.

A problem with arguments of the form "I expect that predictably not paying up will cause them not to threaten me" is that at the time you recieve the threat, you now know that argument to be wrong. They've proven to be somebody who still threatens you even though you do FDT, at which point you can simultaneously prove that refusing the threat doesn't work and so you should pay up (because you've already seen the threat) and that you shouldn't pay up for whatever FDT logic you were using before. Behaviour of agents who can prove a contradiction that is directly relevant to their decision function seems undefined. There needs to be some logical structure that lets you pick which information causes your choice, despite having enough in total to derive contradictions.

My alternative solution is that you aren't convinced by the information you see, that they've actually already threatened you. It's also possible you're still inside their imagination as they decide whether to issue the threat. Whenever something is conditional on your actions in an epistemic state without being conditional on that epistemic state actually being valid (such as if someone predicts how you'd respond to a hypothetical threat before they issue it, knowing you'll know it's too late to stop when you get it) then there's a ghost being lied to and you should think maybe you're that ghost to justify ignoring the threat, rather than try to make decisions during a logically impossible situation.

I totally agree we can be coherently uncertain about logical facts, like whether P=NP. FDT has bigger problems then that.

When writing this I tried actually doing the thing where you predict a distribution, and only 21% of LessWrong users were persuaded they might be imaginary and being imagined by me, which is pretty low accuracy considering they were in fact imaginary and being imagined by me. Insisting that the experience of qualia can't be doubted did come up a few times, but not as aggressively as you're pushing it here. I tried to cover it in the "highly detailed internal subjective experience" counterargument, and in my introduction, but I could have been stronger on that.

I agree that the same argument on philosophers or average people would be much less successful even then that, but that's a fact about them, not about the theory.

If you think you might not have qualia, then by definition you don't have qualia.

What? just a tiny bit of doubt and your entire subjective conscious experience evaporates completely? I can't see any mechanism that would do that, it seems like you can be real and have any set of beliefs or be fictional and have any set of beliefs. Something something map-territory distinction?

This just seems like a restatement of the idea that we should act as if we were choosing the output of a computation.

Yes, it is a variant of that idea, with different justifications that I think are more resilient. The ghosts of FDT agents still make the correct choices, they just have incoherent beliefs while they do it.

Load More