To disentangle the confusion I took a look around about a few different definitions of the concepts. The definitions were mostly the same kind of vague statement of the type:
However, I found some useful tidbits
Uncertainties are characterized as epistemic, if the modeler sees a possibility to reduce them by gathering more data or by refining models. Uncertainties are categorized as aleatory if the modeler does not foresee the possibility of reducing them. [Aleatory or epistemic? Does it matter?]
Which sources of uncertainty, variables, or probabilities are labelled epistemic and which are labelled aleatory depends upon the mission of the study. [...] One cannot make the distinction between aleatory and epistemic uncertainties purely through physical properties or the experts' judgments. The same quantity in one study may be treated as having aleatory uncertainty while in another study the uncertainty maybe treated as epistemic. [Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management]
In the context of machine learning, aleatoric randomness can be thought of as irreducible under the modelling assumptions you've made. [The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk]
[E]pistemic uncertainty means not being certain what the relevant probability distribution is, and aleatoric uncertainty means not being certain what a random sample drawn from a probability distribution will be. [Uncertainty quantification]
With this my updated view is that our confusion is probably because there is a free parameter in where to draw the line between aleatoric and epistemic uncertainty.
This seems reasonable as more information can always lead to better estimates (at least down to considering wavefunctions I suppose) but in most cases having this kind of information and using it is infeasible and thus having the distinction between aleatoric and epistemic depend on the problem at hand seems reasonable.
This seems to ask the question 'Is a change in the quality of x like colour actually causal to outcomes y?'
Yes, I think you are right. Usually when modeling you can learn correlations that are useful for predictions but if the correlations are spurious they might disappear when the distributions changes. As such to know if p(y|x) changes from only observing x, then we would probably need that all causal relationships to y are captured in x?
Good point, my example with the figure is lacking in regards to 1 simply because we are assuming that x is known completely and that the observed y are true instances of what we want to measure. And from this I realize that I am confused about when some uncertainties should be called aleagoric or epistemic.
When I think I can correctly point out epistemic uncertainty:
My confusion from 1:
You bring up a good point in 1 and I agree that this feels like it should be epistemic uncertainty, but at some point the boundary between inherent uncertainty in the process and uncertainty from knowing too little about the process becomes vague to me and I can't really tell when a process is aleatoric or epistemic.
Thanks, I realised that I provided zero context for the figure. I added some.
- is x like the input data?- could y correspond to something like the supervised (continuous) labels of a neural network, which inputs are matched too?
Yes. The example is about estimating y given x where x is assumed to be known.
- does epistemic uncertainty here refer to that inputs for x could be much different from the current training dataset if sampled again (where new samples could turn out be outside of the current distribution)?
Not quite, we are still thinking of uncertainty only as applied to y. Epistemic uncertainty here refers to regions where the knowledge and data is insufficient to give a good estimate y given x from these regions.
To compare it with your dice example, consider x to be some quality of the die such that you think dies with similar x will give similar rolls y. Then aleatoric uncertainty is high for dies where you are uncertain for values of new rolls even after having rolled several similar dies and rolling more similar dies will not help. While epistemic uncertainty is high for dies with qualities you haven't seen enough of.
This is my go-to figure when thinking about aleatoric vs epistemic uncertainty.
Edit: In the context of the figure. The aleatoric uncertainty is high in the left cluster because the uncertainty of where a new data point will be is high and is not reduced by the number of training examples. The epistemic uncertainty is high in regions where there is insufficient data or knowledge to produce an accurate estimate of the output, this would go down with more training data in these regions.
This seems related to a thought I had when reading An overview of 11 proposals for building safe advanced AI. How much harder is it to find an environment that promotes aligned AGI compared to any AGI?
It seems that a lot of the proposals for AGI under the current ML paradigm either utilizes oversight to get a second chance or to get an extra term in the loss-function to promote alignedness. How well either of these types of methods work seem to be dependent on the base rate of aligned AGI to any AGI that can emerge from a particular model and training environment. I'm thinking of it as roughly
where M is some model and Ebase is the training environment without safeguards S to detect deceptive or otherwise catastrophic behavior.
This post seems to concern
how much does the environment compared to the model influence the emergence of AGI?
What I'm trying to get at is that I think a related important question is
how much does the alignedness of an emerging AGI depend on its environment compared to the model?
In around half of the equations there is an extra right parenthesis. It makes reading the equations a bit extra work as it changes the interpretations somewhat.
In most of the equations with an extra right parenthesis, I believe it is the leftmost one (of the right parentheses) that should be removed.
My own way of thinking of Occam's Razor is through model selection. Suppose you have two competing statements H1 (the which did it) and H2 (it was chance or possibly something other than a which caused it (H2=¬H1)) and some observations D (the sequence came up 0101010101). Then the preferred statement is whichever is more probable calculated as
this is simply Bayes rule where
and the model is parametrized by some parameters θ.
Now all this is just the mathematical way of writing that a hypothesis that has more parameters (or more specifically more possible values that it predicts), will not be as strong a statement that predicts a smaller state of outcomes.
In the witch example this would be:
Now what remains is to estimate the priors and the the fraction of outcomes that look like a pattern. We can skip p(D) as we are interested in p(H1|D):p(H2|D).
Now comparing the amount of conditionals in the hypotheses and how surprised I am by them I would roughly estimate a ratio of the priors as something like 2100 in favor to chance, as the witch hypothesis goes against many of my formed beliefs of the world collected over many years, it includes weird choices of living for this hypothetical alien entity, it picks out me as a possible agent of many in the neighborhood, it singles out an arbitrary action of mine and an arbitrary set of outcomes.
For the sake of completeness. The fraction of outcomes that look like a pattern is kind of hard to estimate exactly. However, my way of thinking about it is how soon in the sequence would I postulate the specific sequence that it ended up in. After 0101, I think that the sequence 0101010101 is the most obvious pattern to continue it in. So roughly this is six bits of evidence.
In conclusion, I would say that the probability of the witch hypothesis is lacking around 94 bits of evidence for me to believe it as much as the chance hypothesis.
The downside of this approach to the Solomonoff induction and the minimum message length is that it is clunkier to use and it might be easy to forget to include conditionals or complexity in the priors the same way they can be lost in the English language. The upside is that as a model it is simpler, less ad hoc and builds directly on the product rule in probability and that probabilities sum to one and should thus be preferred by Occam's Razor ;).
Nice write up, I believe I have a better grasp of simulacra levels after this post.
(Think this is missing 1-3 additional roles. Discussion question, what is The Idealist?)
I'll take this as an exercise for the readers.
I'll start with a definition of how I see The Idealist: The Idealist is someone with an ideal of how people should act to have the best consequences in the world. In the most simple case this could simply be someone that believes the truth to be most important and that everyone should stay in Level 1. However, this type of idealist could ironically be seen as a Level 2 move: "I'm telling the truth so that you will stay on Level 1 and tell the truth as well".
It becomes more complicated when the implications of the ideal affects higher simulacra levels. Consider an idealist with the ideal that intelligence is the most important and that advanced communication and progressively higher level reasoning is the way to achieve this. This way you value the complexity arising in Level 4 and you want people to see your group as good and that they move to a higher level.
In common, the types of idealist I can imagine want to affect the map so that people act in accordance with their ideal (Level 2 move), but they also want their group to be perceived as the cool group and will say things that will make undecided people move towards their group (Level 4 move). Therefore I would say that The Idealist is a Level 2 + Level 4 player. Furthermore, The Idealist only says something if it improves both how others perceive the map and how other value the in group (relative to the out group).
I don't know if this is roughly what you had in mind when you thought about The Idealist, but this is my stab at the problem.
As someone just starting out on the path towards becoming AI safety researcher I appreciate this post a lot. I have started worrying about not having enough impact in the field if I could not become established fast enough. However, reading this I think that it might serve me (and the field) better if I instead take my time and instead only enter the field properly if I find that my personal fit seems good and that I can stay in the field for a long time.
Furthermore, this post has helped me in finding possible worthwhile early on projects that could increase my understanding and personal fit for the field.