Yeah that would be interesting, but how would we tell the difference between trivial params (I'm assuming this means function doesn't change anywhere) and equal loss models? Estimate this with a sampling of points out of distribution?
I kind of assumed that all changes in the parameters changed the function, but that some areas of the loss landscape change the function faster than others? This would be my prediction
I think it was smooth-ish, although we didn't plot it. It was decelerating really slowly, but we also didn't quantify that. We could send you the notebook (bit of a mess)?
I really wanted to extend the experiment to more complicated data, and bigger networks with more layers, but we didn't have time. I plan to do this after the MATS program ends in late september if someone else hasn't done it first.
Thanks for reading!
Are you referring to this post? I hadn't read that, thanks for pointing me in that direction. I think technically my subtitle is still correct, because the way I defined priors in the footnotes covers any part of the training procedure that biases it toward some hypotheses over others. So if the training procedure is likely to be hijacked by "greedy genes" then it wouldn't count as having an "accurate prior".
I like the learning theory perspective because it allows us to mostly ignore optimization procedures, making it easier to think about things. This perspective works nicely until the outer optimization process can be manipulated by the hypothesis. After reading John's post I think I did lean too hard on the learning theory perspective.
I didn't have much to say about deception because I considered it to be a straightforward extension of inner misalignment, but I think I was wrong, the "optimization demon" perspective is a good way to think about it.
Recently I've seen a bunch of high status people using "inner alignment" in the more general sense, so I'm starting to think it might be too late to stick to the narrow definition. E.g. this post.
Any AI that does a task well is in some sense optimizing for good performance on that task.
I disagree with this. To me there are two distinct approaches, one is to memorize which actions did well in similar training situations, and the other is to predict the consequences of each action and somehow rank each consequence.
for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue).
I disagree with this, but I can't put it into clear words. I'll think more about it. It doesn't seem true for model-based RL, unless we explicitly build in uncertainty over goals. I think it's only true for humans for value-loading-from-culture reasons.
Hmm yeah I like your edit, it breaks down the two definitions well. I definitely have a preference for the second one, I prefer confusing terms like this to have super specific definitions rather than broad vague ones, because it helps me to think about whether a proposed solution is actually solving the problem being pointed to. I have, like you, seen people using inner (mis)alignment to refer to other things outside of the original strict definition, but as far as I know the comment I linked to is the one that clarifies the definition most recently? I haven't checked this. If there are more recent discussions involving the people who coined the term I would defer to them.
Regarding the crux that you mention in the edit:
whether or not deciding whether an AI is an optimizer, and finding its objective, is a well-defined procedure for powerful AIs
If you mean precisely mathematically well-defined, then I think this is too high a standard. I think it is sufficient that we be able to point toward archetypal examples of optimizing algorithms and say "stuff like that".
I think the main reason I care about this distinction is that generalization error without learned optimizers doesn't seem to be a huge problem, whereas "sufficiently powerful optimizing algorithms with imperfectly aligned goals" seems like a world-ending level of problem. Do you agree with this?
I currently see inner alignment problems as a superset of generalisation error and robustness.
What would you include as an inner alignment problem that isn't a generalization problem or robustness problem?
I like all of this post except for the conclusion. I think this comment shows that the definition of inner alignment requires an explicit optimizer. Your broader definition of inner misalignment is equivalent to generalization error or robustness, which we already have a name for.
I'm glad you wrote a post about this topic. When I was first reading the sequences, I didn't find the posts by Eliezer on Induction very satisfying, and it was only after reading Jaynes and a bunch of papers on Solomonoff induction that I felt I had a better understanding of the situation. This post might have sped up that process for me by a day or two, if I had read it a year ago.
There was a little while where I thought Solomonoff Induction was a satisfying solution to the problem of induction. But there doesn't seem to be any justification for the order over hypotheses in the Solomonoff Prior. Is there discussion/reading about this that I'm missing?
There are several related concepts (mostly from ML) that have caused me a lot of confusion, because of the way they overlap with each other and are often presented separately. These included Occam's Razor and The Problem of Induction, and also "inductive bias", "simplicity", "generalisation", overfitting, model bias and variance, and the general problem of assigning priors. I'd like there to be a post somewhere explaining the relationships between these words. I might try to write it, but I'm not confident I can make it clear.