Hi there, I've been thinking a lot about AI Alignment and values, the latter for longer than the former, admittedly. I'm in graduate school and study values through ethics. I would love to start a conversation about a thought that shot through my mind just last night. In thinking about values, we often focus on the principles, the concepts such as "good" and "bad" -- most simply, the nouns and adjectives. These are challenging to bridge consensus even in the same language, let alone across cultural, linguistic, and geographic boundaries. In my past experience as an English teacher, conveying verbs was always easier than trying to explain things like integrity.
Here's my question:... (read more)
Thanks for this response. I heard a similar discussion recently, with someone talking about whether an algorithm's reward function was activated because it got the answer correct or because it knew it was what the programmers wanted it to do. It's not clear since the decision-making pathways are not always clear, especially with more complex machine learning.
The inner optimizer thing is really interesting; I hadn't heard it coined like that before. Is it in AI's interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize? Variability would decrease in the population and the probability mechanisms of machine learning would approach certainty, thus rendering the AI basically ineffective.