AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...


Different way classifiers can be diverse

Interesting; would you need an unlabelled dataset to do this, or would the lion and husky sets be sufficient?

How an alien theory of mind might be unlearnable

The theoretical argument can be found here: ; basically, "goals plus (ir)rationality" contains strictly more information than "full behaviour or policy".

Humans have a theory of mind that allows us to infer the preferences and rationality of others (and ourselves) with a large amount of agreement from human to human. In computer science terms, we can take agent behaviour and add "labels" about the agent's goals ("this human is 'happy' "; "they have 'failed' to achieve their goal", etc...).

But accessing this theory of mind is not trivial; we either have to define it explicitly, or point to where in the human mind it resides (or, most likely, a mixture of the two). One way or another, we need to give the AI enough labelled information that it can correctly infer this theory of mind - unlabelled information (ie pure observations) are not enough.

If we have access to the internals of the human brain, the task is easier, because we can point to various parts of it and say things like "this is a pleasure centre, this part is involved in retrieval of information, etc...". We still need labelled information, but we can (probably) get away with less.

How an alien theory of mind might be unlearnable

Is that true if I change my simulation to just simulate all the particles in your brain?


The preferences of a system are an interpretation of that system, not a fact about that system. Not all interpretations are equal (most are stupid) but there is no easy single interpretation that gives preferences from brain states. And these interpretations cannot themselves be derived from observations.

How an alien theory of mind might be unlearnable

Or maybe the opportunity was only available for someone who could do (super)intelligent follow up to the initial opportunity.

Classical symbol grounding and causal graphs

The similarity between value extrapolation and symbol grounding (similar to how you stated it) is why I suspect that solving one may solve the other.

Load More