EliasHasle

If repetitions arise from sampling merely due to high *conditional* probability given an initial "misstep", they should be avoidable in an MCTS that sought to maximize *unconditional* probability of the output sequence (or rather conditional upon its input but not upon its own prior output). After entering the "trap" once or a few times, it would simply avoid the unfortunate misstep in subsequent "playouts". From my understanding, that is.

I went into the idea of evaluating on future state representations here: https://www.lesswrong.com/posts/5kurn5W62C5CpSWq6/avoiding-side-effects-in-complex-environments#bFLrwnpjq6wY3E39S (Not sure it is wise, though.)

1y10

It seems like the method is sensitive to the ranges of the game reward and the auxiliary penalty. In real life, I suppose one would have to clamp the "game" reward to allow the impact penalty to dominate even when massive gains are foreseen from a big-impact course?

What if the encoding difference penalty were applied after a counterfactual rollout of no-ops after the candidate action or no-op? Couldn't that detect "butterfly effects" of small impactful actions, avoiding "salami slicing" exploits?

Building upon this thought, how about comparing mutated policies to a base policy by sampling possible futures to generate distributions of the encodings up to the farthest step and penalize divergence from the base policy?

Or just train a sampling policy by GD, using a Monte Carlo Tree Search that penalizes actions which alter the future encodings when compared to a pure no-op policy.

I must admit that I did not understand everything in the paper, but I think this excerpt summarizes a crucial point:

"The key issue here is proper conditioning. The unbiasedness of the value estimates V_i discussed in §1 is unbiasedness conditional on mu. In contrast, we might think of the revised estimates ^v_i as being unbiased conditional on V. At the time we optimize and make the decision, we know V but we do not know mu, so proper conditioning dictates that we work with distributions and estimates conditional on V."

The proposed "solution" converts n independent evaluations into n evaluations (estimates) that respect the selection process, but, as far as I can tell, they still rest on prior value estimates and prior knowledge about the uncertainty of those estimates... Which means the "solution" at best limits introduction of optimizer bias, and at worst... masks old mistakes?

The *big* problem arises when the number of choices is huge and sparsely explored, such as when optimizing a neural network.

But restricting ourselves to n superficially evaluated choices with known estimate variance in each evaluation and with independent errors/noise, then if – as in realistic cases like Monte Carlo Tree Search – we are allowed to perform some additional "measurements" to narrow down the uncertainty, it will be wise to scrutinize the high-expectance choices most – in a way trying to "falsify" their greatness, while increasing the certainty of their greatness if the falsification "fails". This is the effect of using heuristics like the Upper Confidence Bound for experiment/branch selection.

UCB is also described as "optimism in the face of uncertainty", which kind of defeats the point I am making if it is deployed as decision policy. What I mean is that in research, preparations and planning (with tree search in perfect information games as a formal example where UCB can be applied), one should put a lot of effort into finding out whether the seemingly best choice (of path, policy, etc.) *really is that good*, and then make a final choice that *penalizes *remaining uncertainty.

I would like to throw in a Wikipedia article on a relevant topic, which I came across while reading about the related "Winner's curse": https://en.wikipedia.org/wiki/Order_statistic

The math for order statistics is quite neat as long as the variables are independently sampled from the same distribution. In real life, "sadly", choice evaluations may not always be from the same distribution... Rather, they are by definition conditional upon the choices. (https://en.wikipedia.org/wiki/Bapat%E2%80%93Beg_theorem provides a kind of solution in the form of an intractable colossus of a calculation.) That is not to say that there can be found no valuable/informative approximations.

I think "Metaculus" is a pun on "meta", "calculus" and "meticulous".

"WW3" and "28 years passing" are similarly dangerous "events" for the individual gambler. Why invest with a long-term perspective if there is a significant probability that you eventually cannot harvest... Crucially, the probability of not harvesting the reward may be a lot higher in a "force majeure" situation like WW3, even if one stays alive. But on the other hand, an early WW3 would chop off a lot of the individual existential risk associated with 28 years passing. 🤔 I think there could be major biases here, also possibly affected by "doomsday" propaganda centered on exaggerated "nuclear winter" predictions.

The possibility of outright manipulation of prediction markets should be thoroughly considered when the fates of Ukraine, NATO, Russia and Putin are at stake. If the cost of maintaining "favorable yet mutually consistent" predictions is within, say, a few hundred million dollars per year, it could be a good idea, as long as the enemy has no similar opposing operation going on...

Ehm.. Huh? I would say that:

Conditional on being in a billion-human universe, your probability of having an index between 1 and 1 billion is 1, and your probability of having any other index is 0. Conditional on being in a trillion-human universe, your probability of having an index between 1 and 1 trillion is 1, and your probability of having any other index is 0. Also, conditional on being in a trillion-human universe, your probability of having an index between 1 and 1 billion is 1 in a thousand.

That way, the probabilities respect the conditions, and add up to 1 as they should.

This is also confusing to me, as the resulting "probabilities" do not add up.