Heloá's Shortform

Heloá

Exploring to get a clearer idea of my current understanding on AI alignment.

What is the AI alignment problem?

The AI alignment problem is about giving AI a goal and having it understand it as we understand it. It can be expanded into: how to define goals, how to check the AI’s understanding, preferences on how a goal is achieved (eg. don’t cause harm). When human define goals we have preferences (eg. don’t cause harm, be efficient about resources, be mindful of others) that are often not expressed as we may believe them to be common sense. Often times it causes problems among ourselves because different people have different backgrounds and expectations. Also probably for the sake of brevity we generally don’t make these explicit. Lately it’s been more common to see for example code of conducts in projects and for community organizers to write down how people in that community are expected to behave (eg. be courteous and open to different views). How would we go about helping an AI understand that when we give it an objective such as “fetch me a cup of tea”, there are many other things we care about and want to be taken in consideration besides the cup of tea? Let’s take an AI with access to the physical world as an example. The environment is very rich in features. Would it be possible to map it all and assign a value to each element, according to how much it matters to us? It doesn’t sound humanly possible. The AI would need to be able to generalize.

Could we imbue the AI with moral filters? As with Socrates’ Triple Filter test: “is it true, is it good, is it helpful?” Could we add something similar to the AI so it would evaluate its action paths in those lens? Though if it can’t recognize all of the elements in the environment that can be impacted by its actions it may cause great harm while still passing its filters. It could also not understand those filters as we would like it to understand.

How do humans learn about how to behave in different environments and in society? Hopefully through good role models, experience and updating our internal models. Could AI learn through a moral training? Environment and decision making training. It would have to identify the most elements in its environment and extrapolate impact that could be caused, maybe even make a list of what its constraints would be to avoid causing harm to that environment. Then we’d put it through tests in different environments to see how well it generalized.

How AI would get feedback from these training environments? Would they be simulations?

What is the AI alignment problem?

How AI would get feedback from these training environments? Would they be simulations?

Heloá's Shortform

1

What is the AI alignment problem?

What is the AI alignment problem?