Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary: If exploration rates decay to zero, the obvious way of ensuring that exploration occurs infinitely often (have a trader that sells the sentence saying that exploration will happen) may fail when there are long delays before you get feedback on whether exploration happened, because the trader can go indefinitely into (possible) debt due to slow feedback. And, if the trader budgets itself, then it won't do enough trading to acquire unbounded money from the slowly decaying exploration rate.

So, Density-Zero Exploration was motivated by the following concern: A trader could mess up conditional utilities, and the way in which it did it left the trader capable of taking the same action next turn, as detailed here Of course, -exploration takes care of this issue, and further, you don't really need the to remain constant over time, you can have it drop as on the -generable weighting which corresponds to the trades of the enforcer trader.

The obvious hope is that exploration would happen infinitely often along the subsequence, so any enforcer trader would lose eventually. Intuitively, that's how it works. But it's a bit harder to get than I naively thought at first.

My first attempt at proving it (modulo some finicky details about -generable weightings and how they aren't always 1) was something along the lines of "in the limit, the probability of exploration on the 'th element of the subsequence will be very close to , and because of this, if there's only finitely much exploration, it's possible for a trader to get infinite money by selling stocks of the exploration sentence."

However, there's a problem when you don't get immediate feedback on whether the exploration step occurred. If you have to wait a very long time to hear whether the exploration step happened, then the strategy of "sell stocks in the exploration sentence" may leave the trader unboundedly in debt.

For most of the theorems in the logical induction paper, it was acceptable to take a very long time to exploit an arbitrage opportunity, because by assumption, arbitrage opportunities occurred infinitely often. However, because the frequency of exploration on a subsequence drops as , you can't get a contradiction with the logical inductor criterion if the trader only exploits a sparse subsequence of those days.

Therefore, you can't guarantee infinite exploration steps occur with -exploration, and sparse feedback on exploration steps, if you're using the path of "ooh there's infinite money available by selling stocks in the exploration sentence." The trader's plausible value will either be unbounded below (by selling a bunch of overpriced stocks, but selling faster than they get feedback of whether they were worth anything or not), or bounded above (because waiting for feedback for budgeting purposes is slow enough that the trader cannot accumulate infinite money)

I still very strongly expect that on any -generable weighting of the sequence of days, there will be infinite exploration steps, but the obvious way of showing it fails.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 1:54 PM

Update: This isn't really an issue, you just need to impose an assumption that there is some function such that , and is computable in time polynomial in , and you always find out whether exploration happened on turn after days.

This is just the condition that there's a subsequence where good feedback is possible, and is discussed significantly in section 4.3 of the logical induction paper.

If there's a subsequence B (of your subsequence of interest, A) where you can get good feedback, then there's infinite exploration steps on subsequence B (and also on A because it contains B)

This post is hereby deprecated. Still right, just not that relevant.