The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.
I don't think it's good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it's good at jumping over such barriers.) Presumably you're actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.
Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g.... (read more)
The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.
I don't think it's good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it's good at jumping over such barriers.) Presumably you're actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.
Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g.... (read more)