Michele Campolo

Lifelong recursive self-improver, on his way to exploding really intelligently :D

More seriously: my posts are mostly about AI alignment, with an eye towards moral progress and creating a better future instead of risk only.

At the moment I am doing research at CEEALAR on agents whose behaviour is driven by a reflective process analogous to human moral reasoning, rather than by a metric specified by the designer. I'll probably post a short article on this topic before the end of 2023.

Here are some suggested readings from what I've written so far:

-Naturalism and AI alignment
-From language to ethics by automated reasoning
-Criticism of the main framework in AI alignment

Wiki Contributions


To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step

suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for"

seems weak.

Overall, what is the intention behind your comments? Are you trying to understand my position even better,  and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?

I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct:

  1. X is a state of mind that feels bad to whatever mind experiences it (this is the starting assumption, it seems we agree that such an X exists, or at least something similar to X)
  2. X, experienced on a large scale by many minds, is bad
  3. Causing X on a large scale is bad
  4. When considering what to do, I'll discard actions that cause X, and choose other options instead.

Now, some people will object and say that there are holes in this chain of reasoning, i.e. that 2 doesn't logically follow from 1, or 3 doesn't follow from 2, or 4 doesn't follow from 3. For the sake of this discussion, let's say that you object the step from 1 to 2. Then, what about this replacement:

  1. X is a state of mind that feels bad to whatever mind experiences it [identical to original 1]
  2. X, experienced on a large scale by many minds, is good [replaced 'bad' with 'good']

Does this passage from 1 to 2 seems, to you (our hypothetical objector), equally logical / rational / coherent / right / sensible / correct as the original step from 1 to 2? Could I replace 'bad' with basically anything, and the correctness would not change at all as a result?

My point is that, to many reflecting minds, the replacement seems less logical / rational / coherent / right / sensible / correct than the original step. And this is what I care about for my research: I want an AI that reflects in a similar way, an AI to which the original steps do seem rational and sensible, while replacements like the one I gave do not.

we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result.

Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the above paragraph. Agreement might be less easy to achieve, but it doesn't mean finding a common ground is impossible.

For example, some researchers classify emotions according to valence, i.e. whether it is an overall good or bad experience for the experiencer, and in the future we might be able to find a map from brain states to whether a person is feeling good or bad. In this sense of good and bad, I'm pretty sure that moral philosophers who argue for the maximisation of bad feelings for the largest amount of experiencers are a very small minority. In other terms, we agree that maximising negative valence on a large scale is not worth doing.

(Personally, however, I am not a fan of arguments based on agreement or disagreement, especially in the moral domain. Many people in the past used to think that slavery was ok: does it mean slavery was good and right in the past, while now it is bad and wrong? No, I'd say that normally we use the words good/bad/right/wrong in a different way, to mean something else; similarly, we don't normally use the word 'dog' to mean e.g. 'wolf'. From a different domain: there is disagreement in modern physics about some aspects of quantum mechanics. Does it mean quantum mechanics is fake / not real / a matter of subjective opinion? I don't think so)

I might be misunderstanding you: take this with a grain of salt.

From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.

I am not familiar with PPO. From this short article, in the section about TRPO:

Recall that due to approximations, theoretical guarantees no longer hold.

Is this what you are referring to? But is it important for alignment? Let's say the conditions for convergence are not met anymore, the theorem can't be applied in theory, but in practice I do get an agent that goes towards A, where I've put reward. Is it misleading to say that the agent is maximising reward?

(However, keep in mind that

I agree with Turner that modelling humans as simple reward maximisers is inappropriate


If you could unpack your belief "There aren't interesting examples like this which are alignment-relevant", I might be able to give a more precise/appropriate reply.

If I had to pick one between the two labels 'moral realism' and 'moral anti-realism' I would definitely choose realism.

I am not sure about how to reply to "what is the meaning of moral facts": it seems too philosophical, in the sense that I don't get what you want to know in practice. Regarding the last question: I reason about ethics and morality by using similar cognitive skills to the ones I use in order to know and reason about other stuff in the world. This paragraph might help:

It also helps explain how we get to discriminate between goals such as increasing world happiness and increasing world suffering, mentioned in the introduction. From our frequent experiences of pleasure and pain, we categorise many things as ‘good (or bad) for me’; then, through a mix of empathy, generalisation, and reflection, we get to the concept of ‘good (or bad) for others’, which comes up in our minds so often that the difference between the two goals strikes us as evident and influences our behaviour (towards increasing world happiness rather than world suffering, hopefully).

I do not have a clear idea yet of how this happens algorithmically, but an important factor seems to be that, in the human mind, goals and actions are not completely separate, and neither are action selection and goal selection. When we think about what to do, sometimes we do fix a goal and plan only for that, but other times the question becomes about what is worth doing in general, what is best, what is valuable: instead of fixing a goal and choosing an action, it's like we are choosing between goals.

Sorry for the late reply, I missed your comment.

It sounds to me like the claim you are making here is "the current AI Alignment paradigm might have a major hole, but also this hole might not be real".

I didn't write something like that because it is not what I meant. I gave an argument whose strength depends on other beliefs one has, and I just wanted to stress this fact. I also gave two examples (reported below), so I don't think I mentioned epistemic and moral uncertainty "in a somewhat handwavy way".

An example: if you think that futures shaped by malevolent actors using AI are many times more likely to happen than futures shaped by uncontrolled AI, the response will strike you as very important; and vice versa if you think the opposite.

Another example: if you think that extinction is way worse than dystopic futures lasting a long time, the response won't affect you much—assuming that bad human actors are not fans of complete extinction.

Maybe your scepticism is about my beliefs, i.e. you are saying that it is not clear, from the post, what my beliefs on the matter are. I think presenting the argument is more important than presenting my own beliefs: the argument can be used, or at least taken into consideration, by anone who is interested in these topics, while my beliefs alone are useless if they are not backed up by evidence and/or arguments. In case you are curious: I do believe futures shaped by uncontrolled AI are unlikely to happen.

Now to the last part of your comment:

I'm furthermore unsure why the solution to this proposed problem is to try and design AIs to make moral progress; this seems possible but not obvious. One problem with bad actors is that they often don't base their actions on what the philosophers think is good

I agree that bad actors won't care. Actually, I think that even if we do manage to build some kind of AI that is considered superethical (better than humans at ethical reasoning) by a decent amount of philosophers, very few people will care, especially at the beginning. But that doesn't mean it will be useless: at some point in the past, very few people believed slavery was bad, now it is a common belief. How much will such an AI accelerate moral progress, compared to other approaches? Hard to tell, but I wouldn't throw the idea in the bin.

Sorry for the late reply, I missed your comment.

Yeah I get it, probably some moral antirealists think this approach to alignment does not make a lot of sense. I think they are wrong, though. My best guess is that an AI reflecting on what is worth doing will not think something like "the question does not make any sense", but rather it will be morally (maybe also meta-morally) uncertain. And the conclusions it eventually reaches will depend on the learning algorithm, the training environment, initial biases, etc.

Natural language exists as a low-bandwidth communication channel for imprinting one person's mental map onto another person's. The mental maps themselves are formed through direct interactions with an external environment.

It doesn't seem impossible to create a mental map just from language: in this case, language itself would play the role of the external environment. But overall I agree with you, it's uncertain whether we can reach a good level of world understanding just from natural language inputs.

Regarding your second paragraph:

even if this AI had a complete understanding of human emotions and moral systems, it would not necessarily be aligned.

I'll quote the last paragraph under the heading "Error":

Regarding other possible failure modes, note that I am not trying to produce a safety module that, when attached to a language model, will make that language model safe. What I have in mind is more similar to an independent-ethical-thinking module: if the resulting AI states something about morality, we’ll still have to look at the code and try to understand what’s happening, e.g. what the AI exactly means with the term “morality”, and whether it is communicating honestly or is trying to persuade us. This is also why doing multiple tests will be practically mandatory.

The reached conclusion—that it is possible to do something about the situation—is weak, but I really like the minimalist style of the arguments. Great post!

How do you feel about:

There is
a procedure/algorithm which doesn't seem biased towards a particular value system
such that
a class of AI systems that implement it end up having a common set of values, and they endorse the same values upon reflection.

This set of values might have something in common with what we, humans, call values.

If 1 and 2 seem at least plausible or conceivable, why can't we use them as a basis to design aligned AI? Is it because of skepticism towards 1 or 2?

Load More