All of eva_'s Comments + Replies

To entertain that possibility, suppose you're X% confident that your best "fool the predictor into thinking I'll one-box, and then two-box" plan will work, and Y% confident that "actually do one-box, in a way the predictor can predict" plan will work. If X=Y or X>Y you've got no incentive to actually one-box, only to try to pretend you will, but above some threshold of belief the predictor might beat your deception it makes sense to actually be honest.

Either that, or the idea of mind reading agents is flawed.

We shouldn't conclude that, since to various degrees mindreading agents already happen in real life.

If we tighten our standard to "games where the mindreading agent is only allowed to predict actions you'd choose in the game, which is played with you already knowing about the mindreading agent", then many decision theories that are different in other situations might all choose to respond to "pick B or I'll kick you in the dick" by picking B.

2Augs SMSHacks5mo
Mindreading agents do happen in real life but they are often wrong and can be fooled. Most decision theories on this website don't entertain either of these possibilities. If we allow "fooling a predictor" as a possible action then the solution to Newcomb's problem is easy: simply fool the predictor and then take both boxes.

E(U|NS) = 0.8, E(U|SN) = 0.8
Are the best options from a strict U perspective, and exactly tie. Since you've not included mixed actions, the agent must arbitrarily pick one, but arbitrarily picking one seems like favouring an action that is only better because it affects the expected outcome of the war, if I've understood correctly?
I'm pretty sure this is resolved by mixed actions though: The agent can take the policy {NS at 0.5, SN at 0.5}, which also gets U of 0.8 and does not effect the expected outcome of the war, and claim supreme unbiasedness fo...

https://www.yudbot.com/
Theory: LLMs are more resistent to hypnotism-style attacks when pretending to be Eliezer, because failed hypnotism attempts are more plausible and in-distribution, compared to when pretending to be LLMs where both prompt-injection attacks and actual prompt-updates seem like valid things that could happen and succeed.
If so, to make a more prompt-injection resistent mask, you need a prompt chosen to be maximally resistent to mind control, as chosen from the training data of all english literature, whatever that might be. The kind of en...

In general I just don't create mental models of people anymore, and would recommend that others don't either.

That seems to me prohibitive to navigating social situations or even long term planning. When asked to make a promise, if you can't imagine your future self in the position of having to follow through on it, and whether they do that, how can you know you're promising truthfully or dooming yourself to a plan you'll end up regretting? When asking someone for a favor, do you just not imagine how it'd sound from their perspective, to try and predict if ...

The upside and downside both seem weak when it's currently so easy to bypass the filters. The probability of refusal doesn't seem like the most meaningful measure, but I agree it'd be good for the user to get explicitly flagged whether the trained non-response impulse was activated or not, instead of having to deduce it from the content or wonder if the AI really thinks that it's non-answer is correct.

Currently, I wouldn't be using any illegal advise it gives me for the same reason I wouldn't be using any legal advise it gives me, the model just isn't smar...

The regular counterfactual part as I understand it is:
"If I ignore threats, people won't send me threats"
"I am an agent who ignores threats"
"I have observed myself recieve a threat"
You can at most pick 2, but FDT needs all 3 to justify that it should ignoring it.
It wants to say "If I were someone who responds to threats when I get them, then I'll get threats, so instead I'll be someone who refuses threats when I get threats so I don't get threats" but what you do inside of logically impossible situations isn't well defined.

The logical counterfactual part i...

https://www.lesswrong.com/tag/functional-decision-theory argues for choosing as if you're choosing the policy you'd follow in some situation before you learnt any of the relevant infortmation. In many games, having a policy of making certain choicese (that others could perhaps predict, and adjust their own choices accordingly) gets you better outcomes then just always doing what seems like a good idea ta the time. For example if someone credibly threatens you might be better off paying them to go away, but before you got the threat you would've prefered to...

2shminux1y
hmm... Which part of that is a counterfactual conditional statement, as in, a statement of the form " If kangaroos had no tails, they would topple over"?

I totally agree we can be coherently uncertain about logical facts, like whether P=NP. FDT has bigger problems then that.

When writing this I tried actually doing the thing where you predict a distribution, and only 21% of LessWrong users were persuaded they might be imaginary and being imagined by me, which is pretty low accuracy considering they were in fact imaginary and being imagined by me. Insisting that the experience of qualia can't be doubted did come up a few times, but not as aggressively as you're pushing it here. I tried to cover it in the "hig...

4hairyfigment1y
>FDT has bigger problems then that. Does it. The post you linked does nothing to support that claim, and I don't think you've presented any actual problem which definitively wouldn't be solved by logical counterfactuals. (Would this problem also apply to real people killing terrorists, instead of giving in to their demands? Because zero percent of the people obeying FDT in that regard are doing so because they think they might not be real.) This post is actually about TDT, but it's unclear to me why the ideas couldn't be transferred. I also note that 100% of responses in this thread, so far, appear to assume that your ghosts would need to have qualia in order for the argument to make sense. I think your predictions were bad. I think you should stop doing that, and concentrate on the object-level ideas.

If you think you might not have qualia, then by definition you don't have qualia.

What? just a tiny bit of doubt and your entire subjective conscious experience evaporates completely? I can't see any mechanism that would do that, it seems like you can be real and have any set of beliefs or be fictional and have any set of beliefs. Something something map-territory distinction?

This just seems like a restatement of the idea that we should act as if we were choosing the output of a computation.

Yes, it is a variant of that idea, with different justifications th...

4hairyfigment1y
Again, it isn't more resilient, and thinking you doubt a concept you call "qualia" doesn't mean you can doubt your own qualia. Perhaps the more important point here is that you are typically more uncertain of mathematical statements, which is why you haven't removed and cannot remove the need for logical counterfactuals. Real humans have some degree of uncertainty about most mathematical theorems. There may be exceptions, like 0+1=1, or the halting problem and its application to God, but typically we have enough uncertainty when it comes to mathematics, that we might need to consider counterfactuals. Indeed, this seems to be required by the theorem alluded to at the above link - logical omniscience seems logically impossible. For a concrete (though unimportant) example of how regular people might use such counterfactuals in everyday life, consider P=NP. That statement is likely false. Yet, we can ask meaningful-sounding questions about what its truth would mean, and even say that the episode of 'Elementary' which dealt with that question made unjustified leaps. "Even if someone did prove P=NP," I find myself reasoning, "that wouldn't automatically entail what they're claiming." Tell me if I've misunderstood, but it sounds like you're claiming we can't do something which we plainly do all the time. That is unconvincing. It doesn't get any more convincing when you add that maybe my experience of doing so isn't real. I am very confident that you will convince zero average people by telling them that they might not actually be conscious. I'm skeptical that even a philosopher would swallow that.

Actually, I don't assume that, I'm totally ok with believing ghosts don't have qualia. All I need is that they first-order believe they have qualia, because then I can't take my own first-order belief I have qualia as proof I'm not a ghost. I can still be uncertain about my ghostliness because I'm uncertain in the accuracy of my own belief I have qualia, in explicit contradiction of 'cogito ergo sum'. The only reason ghosts possibly having qualia is a problem is that then maybe I have to care about how they feel.

2hairyfigment1y
If you think you might not have qualia, then by definition you don't have qualia. This just seems like a restatement of the idea that we should act as if we were choosing the output of a computation. On its face, this is at least as likely to be coherent as 'What if the claim we have the most certainty of were false,' because the whole point of counterfactuals in general is to screen off potential contradictions.

A valid complaint. I know the answer must be something like "coherent utility functions can only consist of preferences about reality" because if you are motivated by unreal rewards you'll only ever get unreal rewards, but that argument needs to be convincing to the ghost too, whose got more confidence in her own reality. I know that e.g. in Bomb ghost-theory agents choose the bomb even if they think the predictor will simulate them a painful death, because they consider the small amount of money at much greater measure for their real selves to be worth it, but I'm not sure how they get to that position.

2hairyfigment1y
The problem arises because, for some reason, you've assumed the ghosts have qualia. Now, that might be a necessary assumption if you require us to be uncertain about our degree of ghostliness. Necessary or not, though, it seems both dubious and potentially fatal to the whole argument.

They can though? Bomb box 5, incentivise box 1 or 2, bet on box 3 or 4. Since FDT's strategy puts rewarding cooperative hosts above money or grenades, she picks the box that rewards the defector and thus incentivises all 4 to defect from that equilibrium. (I thought over her strategy for quite a bit and there are probably still problems with it but this isn't one of them)

2Chris_Leong1y
Ah sorry, I somehow forgot you could put your money in one box and bet on another box.

How much of this effect is from morality being causally contagious (associating with Evil people turns you Evil) vs. morality being evidientarily contagious (Evil people are more likely to choose to associate with Evil people)?

I'd expect that, all else being equal, organisations secretly run in evil ways will be more willing to secretly accept money from other evil people, for many reasons including that they've got higher expectation of how normal that sort of behaviour is. It seems harder to imagine how a good organisation choosing to take dirty money wo...

If the hosts move first logically, then TDT will lead to the same outcomes as CDT, since it's in each host's interest to precommit to incentivising the human to pick their own box

It's in the hosts interests to do that if they think the player is CDT, but it's not in their interests to commit to doing that. They don't lose anything by retaining the ability to select a better strategy later after reading the players mind.

1Dacyn1y
Yes they do. For simplicity suppose there are only two hosts, and suppose host A precommits to not putting money host B's box, while host B makes no precommitments about how much money he will put in host A's box. Then the human's optimal strategy is "pick host A's box with probability 1 - x epsilon, where x is the amount of money in host A's box". This incentivizes host B to maximize the amount in host A's box (resulting in payoff ~101 for the human), but it would have been better for him if he had precommitted to do the same as A, since then by symmetry his box would have been picked half the time instead of 101 epsilon of the time.

Not whichever is lighter, one of whichever pair is heavier. Yes, I claim an EDT agent upon learning the rules, if they have a way to blind themself but not to force a commitment, will do this plan. They need to maximise the amount of incentive that the hosts have to put money in boxes, but to whatever extent they actually observe the money in boxes they will then expect to get the best outcome by making the best choice given their information. The only middle ground I could find is pairing the boxes, and picking one from the heavier pair. I'd be very happy to learn if I was wrong and there's a better plan or if this doesn't work for some reason.

3Chris_Leong1y
Oh, silly me. Of course, EDT doesn't need to pre-commit b/c EDT just does whatever gives the highest expected value without caring about whether there's a causal impact or not. So when given the decision of whether to weigh the boxes in pairs vs. the decision of weighing them all and picking the heaviest, it's happy to weigh in pairs because that increases how much money it expects to win.

EDT isn't precommiting to anything here, she does her opinion of the best choice at every step. That's still a valid complaint that it's unfair to give her a blindfold though. If CDT found out about the rules of the game before the hosts made their predictions, he'd make himself an explosive collar that he can't take off and that automatically kills him unless he chooses the box with the least money, and get the same outcome as FDT, and EDT would do that as well. For the blindfold strategy EDT only needs to learn the rules before she sees whats in the boxe...

0Chris_Leong1y
Wasn't EDT pre-committing to the strategy of weighing the left two boxes and the right two boxes and then deciding to randomly pick one of the heavier pair? Or are you saying that a blinded EDT automatically adopts this strategy without precommitment? This comment was edited to fix swapping "lighter" and "heavier".

The hosts aren't competing with the human, only each other, so even if the hosts move first logically they have no reason or opportunity to try to dissuade the player from whatever they'd do otherwise. FDT is underdefined in zero-sum symmetrical strategy games against psychological twins, since it can foresee a draw no matter what, but choosing optimal strategy to get to the draw still seems better than playing dumb strategies on purpose and then still drawing anyway.

Why do you think they should be \$100 and \$200? Maybe you could try simulating it?

What happ...

1Dacyn1y
If the hosts move first logically, then TDT will lead to the same outcomes as CDT, since it's in each host's interest to precommit to incentivising the human to pick their own box -- once the host has precommitted to doing this, the incentive works regardless of what decision theory the human uses. In math terms, if x is the choice of which box to incentivize (with "incentivize your own box" being interpreted as "don't place any money in any of the other boxes"), the human gets to choose a box f(x) on the basis of x, and the host gets to choose x=g(f) on the basis of the function f, which is known to the host since it is assumed to be superintelligent enough to simulate the human's choices in hypothetical simulations. By definition, the host moving first in logical time would mean that g is chosen before f, and f is chosen on the basis of what's in the human's best interest given that the host will incentivize box g(f). But then the optimal strategy is for g to be a constant function. Regarding \$100 and \$200, I think I missed the part where you said the human picks the box with the maximum amount of money -- I was assuming he picked a random box. Regarding the question of how to force all the incentives into one box, what about the following strategy: choose box 1 with probability 1 - (400 - x) epsilon, where x is the payoff of box 1. Then it is obviously in each host's interest to predict box 1, since it has the largest probability of any box, but then it is also in each host's interest to minimize 400 - x i.e. maximize x. This is true even though the hosts' competition is zero-sum.

Yes, that's the intended point, and probably a better way of phrasing it. I am concluding against the initial assertion, and claiming that it does make sense to trust people in some situations even though you're implementing a strategy that isn't completely immune to exploitation.

I don't consider the randomized response technique lying, it's mutually understood that their answer means "either I support X or both coins came up heads" or "either I support Y or both coins came up tails". There's no deception because you're not forming a false belief and you both know the precise meaning of what is communicated.

I don't consider penetration testing lying, you know that penetration testers exist and have hired them. It's a permitted part of the cooperative system, in a way that actual scam artists aren't.

What's a word that means "antisoc...

3Gunnar_Zarncke1y
"Deceive" sounds fine. I think the anti-social is implied - in fact, I have trouble coming up with an example of pro-social deceiving. Well, maybe variants of the old hiding Jews from Nazis" example.

Not sure what's unclear here? I mean that you'd generally prefer not to have incentive structures where you need true information from other people and they can benefit at your loss by giving you false information. Paying someone to lie to you means creating an incentive for them to actually decieve you, not merely giving them money to speak falsehoods.

1jacopo1y
They commented without reading the post I guess...

A sufficiently strong world model can answer the question "What would a very smart very good person think about X?" and then you can just pipe that to the decision output, but that won't get you higher intelligence than what was present in the training material.

Shouldn't human goals have to be in the within human intelligence part, since humans have them? Or are we considering exactly human intelligence AI unsafe? Do you imagine a slightly dumber version of yourself failing to actualise your goals from not having good strategies, or failing to even embed t...

2Wuschel Schulz1y
Yes, I would consider humans to already be unsafe, as we already made a sharp left turn that left us unaligned relative to our outer optimiser. Dogs are a good point, thank you for that example. Not sure if dogs have our exact notion of corrigibility, but they definitely seem to be friendly in some relevant sence.

You're far more likely to be a background character than the protagonist in any given story, so a theory claiming you're the most important person in a universe with an enormous number of people has an enormous rareness penalty to overcome before you should believe it instead of that you're just insane or being lied to. Being in a utilitarian high-leverage position for the lives of billions can be overcome by reasonable evidence, but for the lives of 3^^^^3 people the rareness penalty is basically impossible to overcome. Even if the story is true, most of ...

5Mitchell_Porter1y
The apriori unlikelihood of finding oneself at the crux of history (or in a similarly rare situation) is a greatly underrated topic here, I suppose because it works corrosively against making any kind of special effort. If they had embraced a pseudo-anthropic expectation of personal mediocrity, the great achievers of history would presumably have gotten nowhere. And yet the world is also full of people who tried and failed, or who hold a mistaken idea of their own significance; something which is consistent with the rarity of great achievements. I'm not sure what the "rational" approach here might be.
7lc1y
Sounds like you're trying to argue from the anthropic principle that very important games are unlikely, but that's some really fallacious reasoning that asserts a lot of things about what your utility function is like. "Protagonist" is a two-piece word. A very pain averse and unempathetic person might reasonably subjectively consider themselves the most important person in the universe, and assign negative \${a lot} points to them getting tortured to death, but that doesn't mean they're not getting tortured to death.

You can dodge it by having a bounded utilityfunction, or if you're utilitarian and good a function that is at most linear in anthropic experience.

If the mugger says "give me your wallet and I'll cause you 3^^^^3 units of personal happiness" you can argue that's impossible because your personal happiness doesn't go that high.

If the mugger says "give me your wallet and I'll cause 1 unit of happiness to 3^^^^3 people who you altruistically care about" you can say that, in the possible world where he's telling the truth, there are 3^^^^3 + 1 people only one of...

lc1y108

don't make any trade that looks dumb

Ah, well, there you go then.

2Yitz1y
Why is this? I'm not immediately seeing why this is necessarily the case.

Hello, I read much of the background material over the past few years but am a new account. Not entirely sure what linked me here first but 70% guess is HPMoR. compsci / mathematics background. I have mostly joined due to having ideas slightly too big for my own brain to check and wanting feedback, wanting to infect the community that I get a lot of my memes from with my own original memes, having now read enough to feel like LessWrong is occasionally incorrect about things where I can help, and to improve my writing quality in ways that generalise to explaining things to other people.

If you instead say "evidence of", this makes more sense

Accepted and changed, I'm only claiming some information/entanglement, not absolute proof.

It applies to all transactions (because all transactions are fundamentally about trust)

Would it be clearer to say "markets with perfect information"? The problem I'm trying to describe can only occur with incomplete information / imperfect trust, but doesn't require so little information and trust that transactions become impossible in general. There's a wide middleground of imperfect trust where all of real life ...

Weighted by credence means you're scored on Probability*Prediction, which isn't a fair rule.
If I sincerely believe its 60:40 between two options, and I write that down, I expect 36+16=52 payout, but if I write 100:0 I expect 60+0=60 payout, putting more credence on higher probability outcomes gets me more money even in excess of my true beliefs.

Valid, I'm still working on writing up properly the version with full math which is much more complicated, without that math and without payment it consists of people telling their beliefs and being mysteriously believed about them, because everyone knows everyones incentivised to be honest and sincere and the Agreement Theorem says that means they'll agree when they all know everyone elses reasoning.

Possible Example that I think is the minimum case for any kind of market information system like this:

weather.com wants accurate predictions 7 days in advance...