# All of Forged Invariant's Comments + Replies

The timing evidence is thus hostile evidence and updating on it correctly requires superintelligence.

What do you mean by this? It seems trivially false that updating on hostile evidence requires superintelligence; for example poker players will still use their opponent's bets as evidence about their cards, even though these bets are frequently trying to mislead them in some way.

The evidence being from someone who went against the collective desire does mean that confidently taking it at face value is incorrect, but not that we can't update on it.

3Martin Randall4mo
Good callout, that sentence is simplified. I think the conclusion is correct. Epistemic status: personal rule of thumb, defensively oriented. Example: Cursed Monty Hall [https://en.wikipedia.org/wiki/Monty_Hall_problem]. This is like Monty Hall except that we know that Monty doesn't want us to win and is free to reveal whatever evidence he wants, at no cost to himself. Before Monty opens a door, we think that sticking with our choice has the same EV as switching to another door. After Monty opens a door, this should not change our decision. If updating on the evidence would cause us to make a better decision, Monty would not have given us the evidence. It's not quite that simple in other cases. In Cursed Monty Hall, we assume it costs Monty nothing to feed us false evidence. In Poker, it costs money to make a bet. A player's desire to feed false evidence to their opponent is limited by the cost of providing it. Another way of looking at this is that a poker bet is not purely hostile evidence, it is also an action in the game. Another example from Poker is deliberate table talk aimed at deception. This is more purely hostile evidence, it costs no in-game resources to do this. Updating based on table talk is therefore much harder than updating correctly based on bets. Whether it requires a "superintelligence" to update "correctly" is probably down to semantics. In the LessWrong RedButton game, there is a cost to blowing up the home page when one is normally sleeping. We might update a little on the most likely sleeping habits of the attacker. But not too much! The value of the update must be less than the cost of misleading us, or else the attacker will pay the cost in order to mislead us. Whatever value we gain from updating positively about people who were asleep at 5:33:02 PM on 2022-09-26, it must be less than the cost to the attacker of staying up late or waking up early, one day a year. Similarly, for a 200+ karma user there is no clear cost or benefit to

The LW staff are necessary to take down the site. If we assume that there are multiple users that are willing to press the button, then the (shapely-attributed) blame for taking the site down mostly falls on the LW staff, rather than whoever happens to press the button first.

According to http://shapleyvalue.com/?example=8 if there were 6 people who were willing to push the button, the LW team would deserve 85% of the blame. (Here I am considering the people who take actions that act to facilitate bringing down the site as part of the coalition.)

I am not qu...

6Ruby4mo
Oh, you're totally right that you need to account for number of users willing to press the button, of course.

Here is an example of something that comes close from "The Selfish Gene":

One of the best-known segregation distorters is the so-called t gene in mice. When a mouse has two t genes it either dies young or is sterile, t is therefore said to be lethal in the homozygous state. If a male mouse has only one t gene it will be a normal, healthy mouse except in one remarkable respect. If you examine such a male's sperms you will find that up to 95 per cent of them contain the t gene, only 5 per cent the normal allele. This is obviously a gross distortion of the 50

...

I had not thought of self-play as a form of recursive self-improvement, but now that you point it out, it seems like a great fit. Thank you.

I had been assuming (without articulating the assumption) that any recursive self improvement would be improving things at an architectural level, and rather complex (I had pondered improvement of modular components, but the idea was still to improve the whole model). After your example, this assumption seems obviously incorrect.

Alpha-go was improving its training environment, but not any other part of the training process.

The left hand side of the example is deliberately making the mistake described in your article, as a way to build intuition on why it is a mistake.

(Adding instead of averaging in the update summaries was an unintended mistake)

Thanks for explaining how to summarize updates, it took me a bit to see why averaging works.

Seeing the equations, it was hard to intuitively grasp why updates work this way. This example made things more intuitive for me:

If an event can have 3 outcomes, and we encounter strong evidence against outcomes B and C, then the update looks like this:

The information about what hypotheses are in the running is important, and pooling the updates can make the evidence look much weaker than it is.

2Jsevillamol1y
Note that you are making the same mistake than me! Updates are not summarized in the same way as beliefs - for the update the "correct" way is to take an average of theB,Clikelihoods: ⎛⎜⎝10.010.01⎞⎟⎠Posterior=⎛⎜⎝111⎞⎟⎠Prior×⎛⎜⎝10.011⎞⎟⎠ Refute B×⎛⎜⎝110.01⎞⎟⎠Refute C≠(11+1)Prior×(10.01+12 )Refute B×(11+0.012)Refute C≈(10.5)Posterior This does not invalidate the example though! Thanks for suggesting, I think it helps clarify the conondrum.

I found the postmortem over-focuses on what went wrong or was sub-optimal. I would like to point out that I found the event fun, despite being a lurker with no code.

6Measure1y
FYI, I didn't even know the event was going on. This post was my first time hearing that anything had happened this year. I access LW via a shortcut to the All Posts page, and I never saw the modified front page. I didn't even notice last year when the front page actually went down, since all the other pages still worked.

There were some reports of people seeing a frozen countdown on the button, that disappeared when the page was refreshed. Was this an intentional false alarm? I had assumed that was the case, as a false alarm with some evidence that it was false echoes some parts of Petrov's situation nicely.

3Peter Wildeford1y
I will be on the lookout for false alarms.

I had not noticed my own Gel-Mann amnesia when reading that bit, and therefore find your response quite convincing. I had thought that Ziv's answer to (D) made sense due to the FDA being over-cautious about approving things, but both the scope of the precedent and the kinds/directions of errors had not registered with me.

1PatrickDFarley1y
Absolutely, the whole blame-avoidance game would tend to make them over-cautious, but other hazards like regulatory capture (which I'm pretty sure is what happened with nutrition) threaten to make them recklessly wrong (as long as they can still find a way to avoid blame).

One possible strategy would be to make AI more dangerous as quickly as possible, in the hopes it produces a strong reaction and addition of safety protocols. Doing this with existing tools so that it is not an AGI makes it survivable. This reminds me a bit of Robert Miles facial recognition and blinding laser robot. (Which of course is never used to actually cause harm.)

If the AGI can simply double it's cognitive throughput, it can just repeat the action "sleuth to find an under-priced stock" as needed. This does not exhaust the order book until the entire market is operating at AGI-comparable efficiency, at which point the AGI probably controls a large (or majority) share of the trading volume.

Also, the other players would have limited ability to imitate the AGI's tactics, so its edge would last until they left the market.

1Gerald Monroe2y
This is true. Keep in mind that the AGI is trying to make money, it's having to find securities where it predicts humans are going to change the price in a predictable direction in a short time horizon. Most securities will change their price purely by random chance (or in a pattern no algorithm can find) and you cannot beat the market. Now there is another strategy. This has been used by highly successful hedges. If you are the news you can make the market move in the direction you predict. Certain hedges do their research and from a mixture of publicly available and probably insider data find companies in weak financial positions. They then sell them short with near term strike prices on the options and announce publicly their findings. This is a strategy AGI could probably do extremely well.

A hypothesis I had was that the US was sticking to an exact formula due to higher vaccine hesitancy, in order to "play it safe" and give less for anti-vaxers to criticize. After looking at a small handful of countries, I think this is not a significant cause of the difference in responses.

If this were true I would expect countries that have higher vaccine hesitancy to be less likely to do first doses first.

Checking [this data](https://www.thelancet.com/cms/10.1016/S0140-6736(20)31558-0/attachment/720358f5-8df0-405b-b06f-7734cf542a58/mmc1.pdf) which was nea...

From my understanding of the Canada situation, it may have been motivated by less access to vaccines initially. The US did very well in terms of getting lots of vaccines soon (https://ourworldindata.org/covid-vaccinations) while Canada took about 4 months after the US to really get going. Canada may have been more desperate to prevent Covid (or have their numbers stop lagging the US), and thus been less risk-adverse.

This argument does not work for the UK, as they have been ahead of the US the whole time.

I like how this proposal makes explicit the player strategies, and how they are incorporated into the calculation. I also think that the edge case where the agents actions have no effect on the result

I think that this proposal making alignment symmetric might be undesirable. Taking the prisoner's dilemma as an example, if s = always cooperate and r = always defect, then I would say s is perfectly aligned with r, and r is not at all aligned with s.

The result of 0 alignment for the Nash equilibrium of PD seems correct.

I think this should be the alignment mat...

2JonasMoss2y
I believe the upper right-hand corner ofashouldn't be 1; even if both players are acting in each other's best interest, they are not acting in their own best interest. And alignment is about having both at the same time. The configuration of Prisoner's dilemma makes it impossible to make both players maximally satisfied at the same time, so I believe it cannot have maximal alignment for any strategy. Anyhow, your concept of alignment might involve altruism only, which is fair enough. In that case, Vanessa Kosoy has a similar proposal to mine, but not working with sums, which probably does exactly what you are looking for. Getting alignment in the upper right-hand corner in the Prisoner's dilemma matrix to be 1 may be possible if we redefineu(A,B)tou(A,B)=maxu,vuT(A+B)v, the best attainable payoff sum. But then zero-sum games will have maximal instead of minimal alignment! (This is one reason why I definedu(A,B)=maxu,vuTAv+maxu,vuTBv .) (Btw, the coefficient isn't symmetric; it's only symmetric for symmetric games. No alignment coefficient depending on the strategies can be symmetric, as the vectors can have different lengths.)

1/1  0/0

0/0  0.8/-1

I have put the preferred state for each player in bold. I think by your rule this works out to 50% aligned. However, the Nash equilibrium is both players choosing the 1/1 result, which seems perfectly aligned (intuitively).

1/0.5  0/0

0/0  0.5/1

In this game, all preferred states are shared, yet there is a Nash equilibrium where each player plays the move that can get them 1 point 2/3 of the time, and the other move 1/3 of the time. I think it would be incorrect to call this 100% aligned.

(These examples were not obvious ...

4Templarrr2y
Thanks for careful analysis, I must confess that my metric does not consider the stochastic strategies, and in general works better if players actions are taken consequently, not simultaneously (which is much different from the classic description). The reasoning being that for maximal alignment each action of P1 there exist exactly one action of P2 (and vice versa) that is Nash equilibrium. In this case the game stops in stable state after single pair of actions. And maximally unaligned game will have no nash equilibrium at all, meaning the players actions-reactions will just move over the matrix in closed loop. Overall, my solution as is seems not fitted for the classical formulation of the game :) but thanks for considering it!

Another point you could fix using intuition would be complete disinterest. It makes sense to put it at 0 on the [-1, 1] interval.

Assuming rational utility maximizes, a board that results in a disinterested agent would be:

1/0  1/1

0/0 0/1

Then each agent cannot influence the rewards of the other, so it makes sense to say that they are not aligned.

More generally, if arbitrary changes to one players payoffs have no effect on the behaviour of the other player, then the other player is disinterested.