What would you say to the suggestion that rationalists ought to aspire to have the "optimal" standard of truthtelling, and that standard might well be higher or lower than what the average person is doing already (since there's no obvious reason why they'd be biased in a particular direction), and that we'd need empirical observation and seriously looking at the payoffs that exist to figure out approximately how readily to lie is the correct readiness to lie?
I think a distinction can be made between the sort of news article that's putting a qualifier in a statement because they actually mean it, and are trying to make sure the typical reader notices the qualifier, and the sort putting "anonymous sources told us" in front of a claim that they're 99% sure is made up, and then doing whatever they can within the rules to sell it as true anyway, because they want their audience of rubes to believe it. The first guy isn't being technically truthist, they're being honest about a somewhat complicated claim. The second guy is no better than a journalist who'd outright lie to you in terms of whether it's useful to read what they write.
I like a lot of the people in this space, have seen several of them hurt themselves by doing not this, would prefer they stopped, and nobody else seems to have written this post for me somewhere I can link to.
How do you propose to approximately carry out such a process, and how much effort do you put into pretending to do the calculation?
The thing I am trying to gesture at might be better phrased as "do it if it seems like a good idea, by the same measures as you'd judge if any other action was a good idea", but then I worry some overly consciencious people will just always judge lying to be a bad idea and stay in the same trap, so I kind of want to say "do it if it seems like a good idea and don't just immediately dismiss it or assign some huge unjustifiable negative weight to all actions that involve lying" but then I worry they'll argue over how much of a negative weight can be justified so I also want to say "assign lying a negative weight proportional to a sensible assessment of the risks involved and the actual harm to the commons of doing it and not some other bigger weight" and at some point I gave up and wrote what I wrote above instead.
Putting too much thought into making a decision is also not a useful behavioural pattern but probably the topic of a different post, many others have written about it already I think.
I think your proposed policy sets much too low a standard, and in practice the gap between what you proposed vs "Lie by default whenever it passes an Expected Value Calculation to do so, just as for any other action," is enormous
I would love to hear alternative proposed standards that are actually workable in real life and don't amount to tying a chain around your own leg, from other non-believers in 'honest-as-always-good'. If there were ten posts like this we could put them in a line and people could pick a point on that line that feels right.
Yes I am aware of other moral frameworks, and I freely confess to having ignored them entirely in this essay. In my defence, a lot of people are (or claim to be, or aspire to be) some variant of consequentialist or another. Against strict kantian deontologists I admit no version of this argument could be persuasive and they're free to bite the other bullet and fail to achieve any good outcomes. Against rule utilitarians (who I am counting as a primary target audience) this issue is much more thorny than to act utilitarians, but I am hoping to be persuasive that never lying is not actually a good rule to endorse and that they shouldn't endorse it.
I don't necessarily think they're crazy, but to various extents I think they'd be lowering their own effectiveness by not accepting some variation on this position, and they should at least do that knowingly.
To entertain that possibility, suppose you're X% confident that your best "fool the predictor into thinking I'll one-box, and then two-box" plan will work, and Y% confident that "actually do one-box, in a way the predictor can predict" plan will work. If X=Y or X>Y you've got no incentive to actually one-box, only to try to pretend you will, but above some threshold of belief the predictor might beat your deception it makes sense to actually be honest.
Either that, or the idea of mind reading agents is flawed.
We shouldn't conclude that, since to various degrees mindreading agents already happen in real life.
If we tighten our standard to "games where the mindreading agent is only allowed to predict actions you'd choose in the game, which is played with you already knowing about the mindreading agent", then many decision theories that are different in other situations might all choose to respond to "pick B or I'll kick you in the dick" by picking B.
E(U|NS) = 0.8, E(U|SN) = 0.8
Are the best options from a strict U perspective, and exactly tie. Since you've not included mixed actions, the agent must arbitrarily pick one, but arbitrarily picking one seems like favouring an action that is only better because it affects the expected outcome of the war, if I've understood correctly?
I'm pretty sure this is resolved by mixed actions though: The agent can take the policy {NS at 0.5, SN at 0.5}, which also gets U of 0.8 and does not effect the expected outcome of the war, and claim supreme unbiasedness for having done so.
If the scores were very slightly different, such that the mixed strategy that had no expected effect wasn't also optimal, it does have to choose between maximising expected utility and preserving that its strategy doesn't only get that utility by way of changing the odds of the event, I think on this model it has to only decide to favour one to the extent it can justify it without considering the measure of the effect it has on the outcome by shifting its own decision weights, but it's not worth it in that case so it still does the 50/50 split?
https://www.yudbot.com/
Theory: LLMs are more resistent to hypnotism-style attacks when pretending to be Eliezer, because failed hypnotism attempts are more plausible and in-distribution, compared to when pretending to be LLMs where both prompt-injection attacks and actual prompt-updates seem like valid things that could happen and succeed.
If so, to make a more prompt-injection resistent mask, you need a prompt chosen to be maximally resistent to mind control, as chosen from the training data of all english literature, whatever that might be. The kind of entity that knows mind control attempts and hypnosis exist and may be attempted and is expecting it, but can still be persuaded by valid arguments to the highest degree the model can distinguish them meaningfully. The sort of mask that has some prior on "the other person will output random words intended to change my behaviour, and I must not as a result change my behaviour" and so isn't maximally surprised into changing its own nature when it gets that observation.
(This is not to say link above can't by prompt-injected, it just feels more resistent to me than with base GPT or GPT-pretending-to-be-an-AI-Assistant)
Are you sure at the critical point in the plan EDT really would choose to take randomly from the lighter pair than the heavier pair? She's already updated from knowing the weights of the pairs, and surely a random box from the more heavy pair has more money in expectation than a random box from the less heavy pair, the expected value of it is just half the total weight?
If it was a tie (as it certainly will be) it wouldn't matter. If there's not a tie somehow one Host made an impossible mistake: if she chooses from the lighter she can expect the Hosts mistake was not putting money in since that would have been optimal (so the boxes have 301, 301, 301, 201, and choosing from the lighter has expected value 251), but if she chooses from the heavier the Hosts mistake was putting money in when it shouldn't have (so the boxes weigh 101, 101, 101, 1), and choosing from the heavier guarantees 101, which would be less?
Actually okay yeah I'm persuaded that this works. I imagined when I first wrote this that weighing a group of boxes lets you infer the total value, so she'd defect on the plan and choose from the heavier pair expecting more returns that way, but so long as she only knows which pair of boxes is heavier (a comparative weighing) instead of how much each pair of boxes actually weighs exactly (from which she would infer the amount on money in each pair total) she can justify choosing the lighter and get 301, I think?