flodorner

flodorner's Comments

My prediction for Covid-19

"If, at some point in the future, we have the same number of contagious people, and are not at an appreciable fraction of group immunity, it will at that point again be a solid decision to go into quarantine (or to extend it). "

I think for many people the number of infections at which this becomes a good idas has increased as we have more accurate information about the CFR and how quickly realistic countermeasures can slow down an outbreak in a given area, which should decrease credence in some of the worst case scenarios many were worried about a few months ago.

The case for C19 being widespread

"Czech Researchers claim that Chinese do not work well "

This seems to be missing a word ;)

Conflict vs. mistake in non-zero-sum games

Nitpick: I am pretty sure non-zero-sum does not imply a convex Pareto front.

Instead of the lens of negotiation position, one could argue that mistake theorists believe that the Pareto Boundary is convex (which implies that usually maximizing surplus is more important than deciding allocation), while conflict theorists see it as concave (which implies that allocation is the more important factor).

Credibility of the CDC on SARS-CoV-2

Even if the claim was usually true on longer time scales, I doubt that pointing out an organisations mistakes and not entirely truthful statements usually increases the trust in them on the short time scales that might be most important here. Reforming organizations and rebuilding trust usually takes time.

Subagents and impact measures, full and fully illustrated

How do

"One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I'll look at the more general situations of rollouts: rollouts for any policy "

and

"That's the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as would produce anything different from ∅, the A becomes completely unrestrained again."

fit together? In the special case where is the inaction policy, I don't understand how the trick would work.

Attainable Utility Preservation: Scaling to Superhuman

For all auxillary rewards. Edited the original comment.

I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.

Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).

Attainable Utility Preservation: Scaling to Superhuman

I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state where

for all auxillary rewards , where is the optimal policy according to the main reward; while making sure that there exists an action such that

for every . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at .

Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.

How Low Should Fruit Hang Before We Pick It?

Where does

come from?

Also, the equation seems to imply

Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function

at two points, which gives you equality everywhere.

How Low Should Fruit Hang Before We Pick It?

I do not understand your proof for proposition 2.

Load More