I know I've posted similar stuff here before, but I could still do with some people to discuss infohazardous s-risk related stuff that I have anxieties with. PM me.

I don't think that digital footprint is enough to perfectly construct one's mind; but even if that were the case, I'm unsure about this particular view of identity. And even if this theory were correct, there'd likely be many versions of "me" throughout the multiverse, which would outnumber the versions being tortured.

Hedonic asymmetries

Evolution "wants" pain to be a robust feedback/control mechanism that reliably causes the desired amount of avoidance - in this case, the greatest possible amount.

I feel that there's going to be a level of pain for which a mind of nearly any level of pain tolerance would exert 100% of its energy to avoid. I don't think I know enough to comment on how much further than this level the brain can go, but it's unclear why the brain would develop the capacity to process pain drastically more intense than this; pain is just a tool to avoid certain things, and it ceases to become useful past a certain point.

There are no cheap solutions that would have an upper cut-off to pain stimuli (below the point of causing unresponsiveness) without degrading the avoidance response to lower levels of pain.

I'm imagining a level of pain above that which causes unresponsiveness, I think. Perhaps I'm imagining something more extreme than your "extreme"?

It is to be expected that humans who are actively trying to cause pain (or to imagine how to do so) will succeed in causing amounts of pain beyond most anything found in nature.

I'm unsure that "extreme" would necessarily get a more robust response, considering that there comes a point where the pain becomes disabling.

It seems as though there might be some sort of biological "limit" insofar as there are limited peripheral nerves, the grey matter can only process so much information, etc., and there'd be a point where the brain is 100% focused on avoiding the pain (meaning there'd be no evolutionary advantage to having the capacity to process additional pain). I'm not really sure where this limit would be, though. And I don't really know any biology so I'm plausibly completely wrong.

A full explanation to Newcomb's paradox.

I think the idea is that the 4th scenario is the case, and you can’t discern whether you’re the real you or the simulated version, as the simulation is (near-) perfect. In that scenario, you should act in the same way that you’d want the simulated version to. Either (1) you’re a simulation and the real you just won $1,000,000; or (2) you’re the real you and the simulated version of you thought the same way that you did and one-boxed (meaning that you get $1,000,000 if you one-box.)

How much to worry about the US election unrest?

If Trump loses the election, he's not the president anymore and the federal bureaucracy and military will stop listening to him.

He’d still be president until Biden’s inauguration though. I think most of the concern is that there’d be ~3 months of a president Trump with nothing to lose.

If anyone happens to be willing to privately discuss some potentially infohazardous stuff that's been on my mind (and not in a good way) involving acausal trade, I'd appreciate it - PM me. It'd be nice if I can figure out whether I'm going batshit.

How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?
it's much harder to know if you've got it pointed in the right direction or not

Perhaps, but the type of thing I'm describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it's not going to torture everyone if that's the case.)

This seems easier than recognising whether the sign is flipped or just designing a system that can't experience these sign-flip type errors; I'm just unsure whether this is something that we have robust solutions for. If it turns out that someone's figured out a reliable solution to this problem, then the only real concern is whether the AI's developers would bother to implement it. I'd much rather risk the system going wrong and paperclipping than going wrong and turning "I have no mouth, and I must scream" into a reality.

How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

Seems a little bit beyond me at 4:45am - I'll probably take a look tomorrow when I'm less sleep deprived (although still can't guarantee I'll be able to make it through then; there's quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about "sign flip in reward function" or "direction of updates to reward model flipped"-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv's paper's abstract) in general.

