Lifelong recursive self-improver, on his way to exploding really intelligently :D
Background in mathematics, research at CEEALAR now. I focus on AI alignment, with an eye towards moral progress rather than just risk.
You'll find more info in:
-Naturalism and AI alignment
-From language to ethics by automated reasoning
Natural language exists as a low-bandwidth communication channel for imprinting one person's mental map onto another person's. The mental maps themselves are formed through direct interactions with an external environment.
It doesn't seem impossible to create a mental map just from language: in this case, language itself would play the role of the external environment. But overall I agree with you, it's uncertain whether we can reach a good level of world understanding just from natural language inputs.
Regarding your second paragraph:
even if this AI had a complete understanding of human emotions and moral systems, it would not necessarily be aligned.
I'll quote the last paragraph under the heading "Error":
Regarding other possible failure modes, note that I am not trying to produce a safety module that, when attached to a language model, will make that language model safe. What I have in mind is more similar to an independent-ethical-thinking module: if the resulting AI states something about morality, we’ll still have to look at the code and try to understand what’s happening, e.g. what the AI exactly means with the term “morality”, and whether it is communicating honestly or is trying to persuade us. This is also why doing multiple tests will be practically mandatory.
The reached conclusion—that it is possible to do something about the situation—is weak, but I really like the minimalist style of the arguments. Great post!
How do you feel about:
1There isa procedure/algorithm which doesn't seem biased towards a particular value systemsuch thata class of AI systems that implement it end up having a common set of values, and they endorse the same values upon reflection.2This set of values might have something in common with what we, humans, call values.If 1 and 2 seem at least plausible or conceivable, why can't we use them as a basis to design aligned AI? Is it because of skepticism towards 1 or 2?
"How the physical world works" seems, to me, a plausible source-of-truth. In other words: I consider some features of the environment (e.g. consciousness) as a reason to believe that some AI systems might end up caring about a common set of things, after they've spent some time gathering knowledge about the world and reasoning. Our (human) moral intuitions might also be different from this set.
I disagree. Determinism doesn't make the concepts of "control" or "causation" meaningless. It makes sense to say that, to a certain degree, you often can control your own attention, while in other circumstances you can't: if there's a really loud sound near you, you are somewhat forced to pay attention to it.
From there you can derive a concept of responsibility, which is used e.g. in law. I know that the book Actual Causality focuses on these ideas (but there might be other books on the same topics that are easier to read or simply better in their exposition).
At the very least, we have strong theoretical reasoning models (like Bayesian reasoners, or Bayesian EU maximizers), which definitely do not go looking for values to pursue, or adopt new values.
This does not imply one cannot build an agent that works according to a different framework. VNM Utility maximization requires a complete ordering of preferences, and does not say anything about where the ordering comes from in the first place.(But maybe your point was just that our current models do not "look for values")
Why would something which doesn't already have values be looking for values? Why would conscious experiences and memory "seem valuable" to a system which does not have values already? Seems like having a "source of value" already is a prerequisite to something seeming valuable - otherwise, what would make it seem valuable?
An agent could have a pre-built routine or subagent that has a certain degree of control over what other subagents do—in a sense, it determines what are the "values" of the rest of the system. If this routine looks unbiased / rational / valueless, we have a system that considers some things as valuable (acts to pursue them) without having a pre-value, or at least the pre-value doesn't look like something that we would consider a value.
I can't say I am an expert on realism and antirealism, but I have already spent time on metaethics textbooks and learning about metaethics in general. With this question I wanted to get an idea of what are the main arguments on LW, and maybe find new ideas I hadn't considered.
What is "disinterested altruism"? And why do you think it's connected to moral anti-realism?
I see a relation with realism. If certain pieces of knowledge about the physical world (how human and animal cognition works) can motivate a class of agents that we would also recognise as unbiased and rational, that would be a form of altruism that is not instrumental and not related to game theory.
Thank you for the detailed answer! I'll read Three Worlds Collide.
That brings us to the real argument: why does the moral realist believe this? "What do I think I know, and how do I think I know it?" What causal, physical process resulted in that belief?
I think a world full of people who are always blissed out is better than a world full of people who are always depressed or in pain. I don't have a complete ordering over world-histories, but I am confident in this single preference, and if someone called this "objective value" or "moral truth" I wouldn't say they are clearly wrong. In particular, if someone told me that there exists a certain class of AI systems that end up endorsing the same single preference, and that these AI systems are way less biased and more rational than humans, I would find all that plausible. (Again, compare this if you want.)
Now, why do I think this?
I am a human and I am biased by my own emotional system, but I can still try to imagine what would happen if I stopped feeling emotions. I think I would still consider the happy world more valuable than the sad world. Is this a proof that objective value is a thing? Of course not. At the same time, I can imagine also an AI system thinking: "Look, I know various facts about this world. I don't believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don't see other sources of value at the moment. I guess I'll do something about it." (Taken from this comment)
That is an interesting point. More or less, I agree with this sentence in your fist post:
As far as I can tell, we can do science just as well without assuming that there's a real territory out there somewhere.
in the sense that one can do science by speaking only about their own observations, without making a distinction between what is observed and what "really exists".
On the other hand, when I observe that other nervous systems are similar to my own nervous system, I infer that other people have subjective experiences similar to mine. How does this fit in your framework? (Might be irrelevant, sorry if I misunderstood)
I didn't want to start a long discussion. My idea was to get some random feedback to see if I was missing some important ideas I had not considered