Concept Extrapolation
AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...


Generalised models as a category

Thanks so much! Typos have been corrected.

GPT-3 and concept extrapolation

The aim of this post is not to catch out GPT-3; it's to see what concept extrapolation could look like for a language model.

GPT-3 and concept extrapolation

Possibly! Though it did seem to recognise that the words were spelt backwards. It must have some backwards spelt words in its training data, just not that many.

Concept extrapolation: key posts

Thanks for that link. It does seem to correspond intuitively to a lot of the human condition. Though it doesn't really explain value extrapolation, more the starting point from which humans can extrapolate values. Still a fascinating read, thanks!

Attainable Utility Preservation: Scaling to Superhuman

To see this, imagine the AUP agent builds a subagent to make for all future , in order to neutralize the penalty term. This means it can't make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.

I believe this is incorrect. The and are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising to the upmost.

$100/$50 rewards for good references

Hey there! Sorry for the delay. $100 for the best reference. PM for more details.

Load More