Ah, that'd do it too.
MIRI's hiring; maybe you should email Buck.
Finally, this demon becomes so strong that the search gets stuck in a local valley and further progress stops.
I don't see why the gradient with respect to x0 ever changes, and so am confused about why it would ever stop increasing in the x0 direction. Does this have to do with using a fixed step size instead of learning rate?
[Edit: my current thought is that it looks like there's periodic oscillation in the 3rd phase, which is probably an important part of the story; the gradient is mostly about how to point at the center of that well, which means it orbits that center, and x0 progress grinds to a crawl because it's a small fraction of the overall gradient, whereas it would continue at a regular pace if it were a constant learning rate instead, I think.]
Also, did you use any regularization? [Edit: if so, the decrease in response to x0 might actually be present in a one-dimensional version of this, suggesting it's a very different story.]
Not particularly important, but doesn't the American civil war from a few years earlier also fit this description?
This seems right to me; my impression is that the impact of railroads was pretty one-sided in the American civil war (where the North had an extensive and connected rail network that they used a bunch, and the South didn't have as much capacity to begin with and lost it quickly), whereas both France and Prussia had significant rail networks in 1870 (tho the Prussian one was laid out a bit more effectively for war than the French one, which had Paris as the sole hub, meaning if you wanted to move things from one bit of the front line to another you had to backtrack a lot first).
This has the side effect that A* doesn't need to be involved
I thought the thing A* was doing was giving a measure of "answer differently" that was more reliable than something like 'string comparison'. If B's answer is "dog" and B*'s answer was "canine", then hopefully those get counted as "the same" in situations where the difference is irrelevant and "different" in situations where the difference is relevant. If everything can be yes/no, then I agree this doesn't lose you much, but I think this reduces the amount of trickery you can detect.
That is, imagine one of those games where I'm thinking of a word, and you have to guess it, and you can ask questions that narrow down the space of possible words. One thing I could do is change my word whenever I think you're getting close, but I have to do so to a different word that has all the properties I've revealed so far. (Or, like, each time I could answer in the way that leaves me with the largest set of possible words left, maximizing the time-to-completion.) If we do the thing where B says the word, and B* gets to look at B's notes up to point X and say B's word, then the only good strategy for B is to have the word in their notes (and be constrained by it), but this is resistant to reducing it to a yes/no question. (Even if the question is something like "is there a word in B's notes?" B* might be able to tell that B will say "yes" even tho there isn't a word in B's notes; maybe because B says "hey I'm doing the strategy where I don't have a word to be slippery, but pretend that I do have a word if asked" in the notes.)
A wins outright if B and B* answer differently.
As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes." One nice thing about yes/no questions is that maybe you can randomly flip them (so one gets "does B think the animal is a dog?" and the other gets "does B think the animal is not a dog?") so there's no preferred orientation, which would eat the "always say yes" strategy unless the question-flipping is detectable. (Since A is the one asking the question, A might limit themselves to easily reversible questions, but this constrains their ability to clamp down on trickery.)
Suppose I have some active learning setup, where I decide new points to investigate based on expected uncertainty reduction, or update to the model weights, or something. Then it seems like the internals of the model could be an example of these diagnostic prediction logs being relevant without having to have humans look at them. Then it seems like there might be competition among subnetworks to have the new training examples be places where they'll do particularly well, or to somehow avoid areas where they'll do poorly.
I have a hard time making this story one where this is a bug instead of a feature, tho; in order for a subnetwork to do particularly well, it has to know something about the real data-generating distribution that the rest of the model doesn't. This only looks pathological if the thing that it knows is manufactured by the model, somehow. (Like, if I can write a fictional story and win trivia contests based on my fictional story, then I can hack points.)
I interpreted it as an ensemble of expert models, weighted in a Bayesian fashion based on past performance. But because of the diagnostic logs, the type signature is a little different; the models output both whatever probability distributions over queries / events and arbitrary text in some place.
Then there's a move that I think of as the 'intentional stance move', where you look at a system that rewards behavior of a particular type (when updating the weights based on past success, you favor predictions that thought an event was more likely than its competitors did), and so pretend that the things in the system "want" to do the behavior that's rewarded. [Like, even in this paragraph, 'reward' is this sort of mental shorthand; it's not like any of the models have an interior preference to have high weight in the ensemble, it's just that the ensemble's predictions are eventually more like the predictions of the models that did things that happened to lead to having higher weight.]
On things like PredictionBook, is it easy to compare predictions of the question-asker and others? It seems like the sort of thing where I want to predict facts that I'm ~50% on, but outsiders are more extreme (because I'm more likely to be unusually confused about the question), but I'm not sure how that compares to other effects (like general overconfidence).
I wanted to note that I'm also quite worried about this when it comes to debate (or, really, most things with human models), but think this is an empirical question about human psychology / training mechanisms, where making progress on it (at this stage) looks a lot like just making generic progress. If we have a small set of people who can judge debates, that's sufficient to eventually use this; if we can identify the set of cognitive tools judges need to have in order to succeed, that's useful; but until we have a bunch of debates and throw humans at them, it's not obvious how we'll get empirical answers to empirical questions.
I would be interested in how much current work on debate is motivated by the "reason works" story where truth reliably has an edge, and how much of it is motivated by something more like computational complexity concerns (where, in the presence of optimization processes, you can apply a constraint to an execution trace and the processes generalize that constraint out to their possibility space). For the latter, you might give up on human-readable language and human evaluation of correctness without giving up on the general cognitive structure.
I added a spoiler box.