Categorising the ways that the strategy-stealing assumption can fail:
Ah, you were talking about this article. Me and Daniel were saying that "Kolmogorov Complexity" never shows up in the linked ssc article (thinking that Zvi accidentally wrote "Kolmogorov Complexity" when he meant "Kolmogorov Complicity").
I can't find it either. Could you quote or screenshot?
Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?My understanding: After going through the process of finding z, you'll have a z that's probably too large for the human to fully utilise on their own, so you'll want to use amplification or debate to access it (as well as to generally help the human reason). If we didn't have z, we could train an amplification/debate system on D' anyway, while allowing the human and AIs to browse through D for any information that they need. I don't see how the existence of z makes amplification or debate any more aligned, but it seems plausible that it could improve competitiveness a lot. Is that the intention?
Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I'd be happy to count an exponential speedup as the latter.)
It's worth noting that their language model still uses BPEs, and as far as I can tell the encoding is completely optimised for English text rather than code (see section 2). It seems like this should make coding unusually hard compared to the pretraining task; but maybe make pretraining more useful, as the model needs time to figure out how the encoding works.
I'm really surprised at how big your cards are! When I did anki regularly, I remember getting a big ugh-feeling from cards much smaller than yours, just because there were so many things that I had to consciously recapitulate. It was also fairly common that I missed some little detail and had to choose between starting the whole card over from scratch (which is a big time sink since the card takes so much time at every repeat) or accept that I might never remember that detail.
I'm super curious about your experience of e.g. encountering the function question. Do you try to generate both an example and a formalism, or just the formalism? Do you consciously recite a definition in words, or check some feeling of remembering what the definition is, or mumble something in your mind about how a function is a set of ordered pairs? Is the domain/range-definitions just there as a reminder when you read it, or do you aim to remember them every time? Do you reset or accept if you forget to mention a detail?
Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we're in a semantics labyrinth)
Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation
Idk, if the real world is a simulation made by malign simulators, I wouldn't say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I'm in even if it's simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.
you are changing the definition of outer alignment if you think it assumes we aren't in a simulation
Fwiw, I think this is true for a definition that always assumes that we're outside a simulation, but I think it's in line with previous definitions to say that the AI should think we're not in a simulation iff we're not in a simulation. That's just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we're in the real world or in a simulation that's identical to the real world; but they would be able to tell whether we're in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that's the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.
And then, yeah, in practice I agree we won't be able to learn whether we're in a simulation or not, because we can't guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it's not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.
Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous.
I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve
I mean, it's true that I'm mostly just trying to clarify terminology. But I'm not necessarily trying to propose a new definition – I'm saying that the existing definition already implies that malign priors are an inner alignment problem, rather than than an issue with outer alignment. Evan's footnote requires the model to perform optimally on everything it actually encounters in the real world (rather than asking it to do as well as it can across the multiverse, given its training data); so that definition doesn't have a problem with malign priors. And as Richard notes here, common usage of "inner alignment" refers to any case where the model performs well on the training data but is misaligned during deployment, which definitely includes problems with malign priors. And per Rohin's comment on this post, apparently he already agrees that malign priors are an inner alignment problem.
Basically, the main point of the post is just that the 11 proposals post is wrong about mentioning malign priors as a problem with outer alignment. And then I attached 3 sections of musings that came up when trying to write that :)
Things I believe about what sort of AI we want to build:
Things I believe about how to choose definitions:
Things I believe about what these candidate definitions would imply:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the future
I don't think this is right. I've put my proposed modifications in cursive:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don't have ground-truth for the future, so we can't test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future.
(It might be a good idea to share some parameters between the second and first network.)