For 1. resampling is not (asking the AI to try again) it is forcibly rewriting it's history so that it made a different one of its plausible decisions. Every resampled continuation plausibly could have been the original continuation, given a different rng seed, and there is no nonshared place to put a scratchpad that would pass the information of the resampling event. It would probably increase the risk that you resample into a less suspicious bad path, but it is outweighed by the fact that the AI has to mostly bide it's time (in order to not get caught in the fraction with good monitoring done in the first requrests) (this is a previous important result), so on net it is a gain.
For 2. The model used in this paper is investigating that the untrusted model is behaving optimally given it's ability to actually do sneaky bad stuff. This requires calculating the optimum knowing the monitoring protocol, so the untrusted AI must know about the monitor. There is not in this model (does not know about the monitoring setup), because if the model actually thinks that there is no monitoring, it just tries immediately and then loses. If it thinks there is monitoring, it is really hard to trick it about the monitoring strategy.
for 3. No, you want to use the same model, since you can prevent the resampled branches from knowing about each other (basically they only share the shared history (history before resampling) and weights). This is basically what you would want from another "instance". The modeling assumption is that there is exactly one untrusted AI, which wins if any path breaks out, so there is not any meaningfully different untrusted AI to ask to try from that starting point.
One box P1 and Two box P1 are not the only strategies. You can also try to find some source of randomness to take a mixed strategy from. If you are not able to take mixed strategies, Game theory and decision theory are not guaranteed to converge.
How does the typical construction of Newcomb's problem handle mixed strategies?
The question of why no "might kill us" as a class is simple. There is such a class, but if it lost the fight to kill us, it obviously was not ASI (picking a fight with the world and losing is pretty dumb), or it might win, at which case it won, we die. And then we will be in the same scenario for every AI stronger than it, and for AI weaker than it that might yet get lucky, just as we might get lucky and win at bad odds. The next AI we make will also want to fight us for the same reasons, and we will need to either fight it to (including preemptively, e.g. turning it off because a dumber model did something), or get a reason to believe that we will never fight it. And if you know you will fight your AI eventually, and you will win now, fight now.
if it were competent enough to reflect and actively reconcile its own inner drives
Why do we think that reflection is neccesary for competence. That is, competence does not seem to imply coherence, unless I missed an argument.
This analysis is slightly incorrect, since if rewards programs are treated as assets, they contribute to net enterprise value, not to market cap. Market cap mostly subtracts loans outstanding from the total expected returns of the enterprise. American, for instance, has 37 billion dollars out in debt, and the market cap is the expected value above the expected payout to the loans.
So in this analysis, American is actually worth 47 billion-ish dollars. 37 billion of those are owned by the creditors, and 10 billion is owned by the shareholders. 22 billion of that worth is the rewards program, and the rest is the rest of the business.
If american's rewards program disappeared, the creditors would take a haircut, but they would not get nothing.
You can do the same for all of these.
As an aside, every time I have seen someone saying this, they neglect to model debt. I think it might be systematically confusing.
For these hidden reasoning steps, especially the epiphenomenal model, there can also be redundancy between weight computation and chain of thoughts. That is, a dumb model seeing the chain of thought of a smart model might get the right answer when it would not otherwise, even if the chain of thought does not help the large model.
Under the assumption of separate initialization, this probably does not happen in cases where the extra information is stenographic, or in some examples of irrelevant reasoning because things are not being passed through the token bottleneck.
I am assuming the first step is to count the chromosomes, and isolate that you have exactly a collection of full sets, which came from the same cells, so that the property, (there are k of each exactly) holds. I would need to look in literature to see the sucess rate of isolating all chromosomes, not just (at least one)
It seems for this method you must get 22 sucessful sequences in a row, which is hard if sequencing fails even sometimes, or you lose a chromosome even sometimes.