First, I want to summarize what I understand to be what your example is an example of:"A triple consisting of1) A predicate P2) the task of generating any single input x for which P(x) is true3) the task of, given any x (and given only x, not given any extra witness information), evaluating whether P(x) is true"
For such triples, it is clear, as your example shows, that the second task (the 3rd entry) can be much harder than the first task (the 2nd entry).
On the other hand, if instead one had the task of producing an exhaustive list of all x such that P(x), this, I think, cannot be easier that verifying whether such a list is correct (provided that one can easily evaluate whether x=y for whatever type x and y come from), as one can simply generate the list, and check if it is the same list.
Another question that comes to mind is: Are there predicates P such that the task of verifying instances which can be generated easily, is much harder than the task of verifying those kinds of instances?
It seems that the answer to this is also "yes": Consider P to be "is this the result of applying this cryptographic hash function to (e.g.) a prime number?". It is fairly easy to generate large prime numbers, and then apply the hash function to it. It is quite difficult to determine whether something is the hash of a prime number (... maybe assume that the hash function produces more bits for longer inputs in order to avoid having pigeonhole principle stuff that might result in it being highly likely that all of the possible outputs are the hash of some prime number. Or just put a bound on how big the prime numbers can be in order for P to be true of the hash.)
(Also, the task of "does this machine halt", given a particular verifying process that only gets the specification of the machine and not e.g. a log of it running, should probably(?) be reasonably easy to produce machines that halt but which that particular verifying process will not confirm that quickly.So, for any easy-way-to-verify there is an easy-way-to-generate which produces ones that the easy-way-to-verify cannot verify, so that seems to be another reason why "yes", though, there may be some subtleties here?)
As you know, there's a straightforward way to, given any boolean circuit, to turn it into a version which is a tree, by just taking all the parts which have two wires coming out from a gate, and making duplicates of everything that leads into that gate.I imagine that it would also be feasible to compute the size of this expanded-out version without having to actually expand out the whole thing?
Searching through normal boolean circuits, but using a cost which is based on the size if it were split into trees, sounds to me like it would give you the memoization-and-such speedup, and be kind of equivalent theoretically to using the actual trees?
A modification of it comes to mind. What if you allow some gates which have multiple outputs, but don't allow gates which duplicate their input? Or, specifically, what if you allow something like toffoli gates as well as the ability to just discard an output? The two parts of the output would be sorta independent due to the gate being invertible (... though possibly there could be problems with that due to original input being used more than once?)
I don't know whether this still has the can-do-greedy-search properties you want, but it seems like it might in a sense prevent using some computational result multiple times (unless having multiple copies of initial inputs available somehow breaks that).
It seems like the 5th sentence has its ending cut off? "it tries to parcel credit and blame for a decision up to the input neurons, even when credit and blame" , seems like it should continue [do/are x] for some x.
When you say "which yields a solution of the form f(w)=c1/(1−w)+c2", are you saying that f′(w)/f(w)=1/(1−w) yields that, or are you saying that (1−w)f′(w)−f(w)>0 yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form f(w)=c1/(1−w) .
But, if the latter, then, I would think that the solutions would be more solutions than that?
Like, what about g(w):=c1/(1−w)+c2⋅(1−1106+1106cos(w)) ? (where, say, c1=ε+δ and c2=−δg′(w)=c1(1−w)2+c2⋅(−1106sin(w)) . so (1−w)g′(w)−g(w)=c2⋅(−1106(1−w)sin(w))−c2⋅(1−1106+1106cos(w))=−c2⋅(1−10−6⋅(1−cos(w)−(1−w)sin(w)))which, for c2<0, and w∈[0,1], is positive, and so g should also be a solution to (1−w)f′(w)−f(w)>0 , yes?
Or, were you just giving a subset of the solutions, of the differential inequality, which were all the ones you needed to make the point?
Separately (btw, what's the convention if one has unrelated comments? to put them in one comment or separate comments?) :As for how the network could detect something about its current weights, if it is a recurrent network, this seems like something that some form of this should be learnable.
Suppose you set up the architecture for a recurrent neural network, and pick some arbitrary linear combination (but where the coefficients aren't especially big) of the weights in the network, and then for the loss function, take the main part of the network's output as doing some normal task, and then another part of the output be judged for how well it approximates the current value of the linear combination of all the network's weights, with the initially chosen coefficients. It seems plausible to me that this would do well. Of course, one way it might fail to work properly is if this linear combination ends up becoming roughly constant, or just if some weights stop changing during the training, leading to this second output not getting the signal needed to learn that part. Maybe if you used dropout and took that which weights were dropped out when evaluating the weighted sum of the weights (counting them as zero) in the loss, it could fix that issue. (dropout being done the same in each recurrence, not separately.)
Why "a linear combination of the weights" instead of the particular weights? Well, I imagine getting all the individual weights would be not particularly feasible unless the architecture was specifically designed to do that (like in the architectures for those quine-ing neural nets), but, if we are looking at a mix between "how the net should be", and "a net that is doing gradient hacking", then it seems like "some particular weighted sum of the weights", might be all that would be needed. Or perhaps two weighted sums with different coefficients if having the w1 and w2 separate.
On the other hand, if the network isn't recurrent, or if all the recurrence is done in a vector-quantization-y way, I would imagine it would be more difficult for the network to estimate a given linear combination of its current weights. (Though, if the linear combination of weights is only of weights before a certain layer, then layers after that layer might be able to approximate it without any recurrence, but, the layers that do and use this approximation could be modified without this being detected, and therefore, I'd think it couldn't protect these later parts from being changed by gradient descent, or at least, it would be harder.)
(I'm imagining, if people are trying to make this sort of thing work, that these networks are using just RELU, not like a smoothed version of RELU, so that the values of the weights are reflected more piecewise-linearly in the activations of one layer, given activations in the previous layer, which I think would facilitate estimating other linear combinations of the weights between the previous layers, and as such also for estimating this for a linear combination of weights in any layer.)
As another "why not just" which I'm sure there's a reason for:
in the original circuits thread, they made a number of parameterized families of synthetic images which certain nodes in the network responded strongly to in a way that varied smoothly with the orientation parameter, and where these nodes detected e.g. boundaries between high-frequency and low-frequency regions at different orientations.
If given another such network of generally the same kind of architecture, if you gave that network the same images, if it also had analogous nodes, I'd expect those nodes to have much more similar responses to those images than any other nodes in the network. I would expect that cosine similarity of the "how strongly does this node respond to this image" would be able to pick out the node(s) in question fairly well? Perhaps I'm wrong about that.
And, of course, this idea seems only directly applicable to feed-forward convolution networks that take an image as the input, and so, not so applicable when trying to like, understand how an agent works, probably.
(well, maybe it would work in things that aren't just a convolutions-and-pooling-and-dilation-etc , but seems like it would be hard to make the analogous synthetic inputs which exemplify the sort of thing that the node responds to, for inputs other than images. Especially if the inputs are from a particularly discrete space, like sentences or something. )
But, this makes me a bit unclear about why the "NP-HARD" lights start blinking.
Of course, "find isomorphic structure", sure.But, if we have a set of situations which exemplify when a given node does and does not fire (rather, when it activates more and when it activates less) in one network, searching another network for a node that does/doesn't activate in those same situations, hardly seems NP-hard. Just check all the nodes for whether they do or don't light up. And then, if you also have similar characterizations for what causes activation in the nodes that came before the given node in the first network, apply the same process with those on the nodes in the second network that come before the nodes that matched the closest.
I suppose if you want to give a overall score for each combination of "this sub-network of nodes in the new network corresponds to this other network of nodes in the old-and-understood network", and find the sub-network that gives the best score, then, sure, there could be exponentially many sub-networks to consider. But, if each well-understood node in the old network generally has basically only one plausibly corresponding node in the new network, then this seems like it might not really be an issue in practice?
But, I don't have any real experience with this kind of thing, and I could be totally off.
I was surprised by how the fine-tuning was done for the verbalized confidence.
My initial expectation was that it would make the loss be based on like, some scoring rule based on the probability expressed and the right answer.
Though, come to think of it, I guess seeing as it would be assigning logits values to different expressions of probabilities, it would have to... what, take the weighted average of the scores it would get if it gave the different probabilities? And, I suppose that if many training steps were done on the same question/answer pairs, then the confidences might just get pushed towards 0% or 100%?
Ah, but for the indirect logit it was trained using the "is it right or wrong" with the cross-entropy loss thing. Ok, cool.
For m:S such that m is a mesa=optimizer let Σm be the space it optimizes over, and gm:Σm→R be its utility function .
I know you said "which we need not notate", but I am going to say that for s:S and x:X , that s(x):A , and A is the space of actions (or possibly, s(x):Ax and Ax is the space of actions available in the situation x )(Though maybe you just meant that we need note notate separately from s, the map from X to A which s defines. In which case, I agree, and as such I'm writing s(x):A instead of saying that something belongs to the function space X→A . )
For m:S to have its optimization over Σm have any relevance, there has to be some connection between the chosen σ:Σm (chosen by m) , and m(x) .
So, the process by which m produces m(x) when given x, should involve the selected σ .Moreover, the selection of the σ ought to depend on x in some way, as otherwise the choice of σ is constant each time, and can be regarded as just a constant value in how m functions.
So, it seems that what I said was gm:Σm→R should instead be either gm:X×Σm→R , or gm,x:Σm,x→R (in the latter case I suppose one might say gm:∑x:X(Σm,x)→R )
Call the process that produces the action m(x):A using the choice of σ:Σm by the name hm:X×Σm→A (or more generally, hm:∑x:X(Σm,x)→A ) .
hm is allowed to also use randomness in addition to x and σ . I'm not assuming that it is a deterministic function. Though come to think of it, I'm not sure why it would need to be non-deterministic? Oh well, regardless.
Presumably whatever f:S→R is being used to select s:S , depends primarily (though not necessarily exclusively) on what s(x) is for various values of x, or at least on something which indicates things about that, as f is supposed to be for selecting systems which take good actions?
Supposing that for the mesa-optimizer m that the inner optimization procedure (which I don't have a symbol for) and the inner optimization goal (i.e. gm ) are separate enough, one could ask "what if we had m, except with gm replaced with g′m , and looked at how the outputs of hm(x,σ) and hm(x,σ′) differ, where σ and σ′ are respectively are selected (by m's optimizer) by optimizing for the goals gm , and g′m respectively?".
Supposing that we can isolate the part of how f(s) depends on s which is based on what s(x) is or tends to be for different values of x, then there would be a "how would f(m) differ if m used g′m instead of gm ?".If g′m in place of gm would result in things which, according to how f works, would be better, then it seems like it would make sense to say that gm isn't fully aligned with f ?
Of course, what I just described makes a number of assumptions which are questionable:
The first of these is also connected to another potential flaw with what I said, which is, it seems to describe the alignment of the combination of (the optimizer m uses) along with gm , with f, rather than just the alignment of gm with f.
So, alternatively, one might say something about like, disregarding how the searching behaves and how it selects things that score well at the goal gm , and just compare how hm(x,σ) and hm(x,σ′) tend to compare when σ and σ′ are generic things which score well under gm and g′m respectively, rather than using the specific procedure that m uses to find something which scores well under gm , and this should also, I think, address the issue of m possibly not having a cleanly separable "how it optimizes for it" method that works for generic "what it optimizes for".
The second issue, I suspect to not really be a big problem? If we are designing the outer-optimizer, then presumably we understand how it is evaluating things, and understand how that uses the choices of s(x):A for different x:X .
I may have substantially misunderstood your point?
Or, was your point that the original thing didn't lay these things out plainly, and that it should have?
Ok, reading more carefully, I see you wrote
I can certainly imagine that it may be possible to add in details on a case-by-case basis or at least to restrict to a specific explicit class of base objectives and then explicitly define how to compare mesa-objectives to them.
and the other things right before and after that part, and so I guess something like "it wasn't stated precisely enough for the cases it is meant to apply to / was presented as applying as a concept more generally than made sense as it was defined" was the point and which I had sorta missed it initially.
(I have no expertise in these matters; unless shown otherwise, assume that in this comment I don't know what I'm talking about.)
Is this something that the infra-bayesianism idea could address? So, would an infra-bayesian version of AIXI be able to handle worlds that include halting oracles, even though they aren't exactly in its hypothesis class?
Do I understand correctly that in general the elements of A, B, C, are achievable probability distributions over the set of n possible outcomes? (But that in the examples given with the deterministic environments, these are all standard basis vectors / one-hot vectors / deterministic distributions ?)
And, in the case where these outcomes are deterministic, and A and B are disjoint, and A is much larger than B, then given a utility function on the possible outcomes in A or B, a random permutation of this utility function will, with high probability, have the optimal (or a weakly optimal) outcome be in A?(Specifically, if I haven't messed up, if asymptotically (as |B| goes to infinity) |B|2|A|+1→0 then the probability of there being something in A which is weakly better than anything in B goes to 1 , and if |B|2|A|+1→r then the probability goes to at least e−r , I think?Coming from (|A||B|)|B|!|A|!(|A|+|B|)!=|A|!(|A|−|B|)!|A|!(|A|+|B|)!=∏|B|−1k=0|A|−k|A|+|B|−k=∏|B|−1k=0(1−|B||A|+|B|−k) )
While I'd readily believe it, I don't really understand why this extends to the case where the elements of A and B aren't deterministic outcomes but distributions over outcomes. Maybe I need to review some of the prior posts.
Like, what if every element of A was a probability distribution with over 3 different observation-histories (each with probability 1/3) , and every element of B was a probability distribution over 2 different observation-histories (each with probability 1/2)? (e.g. if one changes pixel 1 at time 1, then in addition to the state of the pixel grid, one observes at random either a orange light or a purple light, while if one instead changes pixel 2 at time 1, in addition to the pixel grid state, one observes at random either a red, green, or blue light, in addition to the pixel grid) Then no permutation of the set of observations-histories would convert any element of A into an element of B, nor visa versa.
One could create a program which hard-codes the point about which it oscillates (as well as some amount which it always eventually goes that far in either direction), and have it buy once when below, and then wait until the price is above to sell, and then wait until price is below to buy, etc.
The programs receive as input the prices which the market maker is offering.
It doesn't need to predict ahead of time how long until the next peak or trough, it only needs to correctly assume that it does oscillate sufficiently, and respond when it does.