Meta Alignment: Education

Bridgett Kay

This is a linkpost for https://wordpress.com/post/dxmrevealed.wordpress.com/1521

Epistemic status: No rigor. Full of speculations from a person who is worried most of the time, and anxious the remaining. Stating the obvious.

This thread, which I was able to read courtesy of Don’t Worry About the Vase’s weekly AI update, seems indicative of how an average, reasonably-educated person would react to all of this alignment/x-risk/AI-doom stuff that has been circulating in the media, seemingly for the first time, while we have been in our lesswrong-adjacent bubbles.

And I can’t help but recall a comment I made in lesswrong:

“Lately I’ve been appreciating, more and more, something I’m starting to call “Meta-Alignment.” Like, with everything that touches AI, we have to make sure that thing is aligned just enough to where it won’t mess up or “misalign” the alignment project. For example, we need to be careful about the discourse surrounding alignment, because we might give the wrong idea to people who will vote on policy or work on AI/AI adjacent fields themselves. Or policy needs to be carefully aligned, so it doesn’t create misaligned incentives that mess up the alignment project; the same goes for policies in companies that work with AI. This is probably a statement of the obvious, but it is really a daunting prospect the more I think about it.”

Education is important. Education leads to policy. Policy leads to funding. Policy may lead to a pause. A pause may increase our chances of survival. Education and funding lead to research. Research may lead to answers. Answers may lead to survival.

Recently, while reading a book on game theory and human behavior, I thought about a phrase you hear quite often when people talk about AI alignment- “utility function.” It’s used often to describe an agent’s “preferences,” and it is often chosen over “preferences” for accuracy’s sake, for an agent will not simply prefer one thing, and it will not simply “prefer” to do what we tell it. How many laypeople would recognize the word’s meaning and its implications if it was lobbed at them in a podcast? In a debate? In a policy discussion?

I don’t have social media, much less an army of undergraduates, so I could not take a comprehensive survey. I merely asked a few people close to me the following question.

“If I said the statement “An AI doesn’t care about humans; it only cares about fulfilling its own particular utility function.” Without looking anything up, what would you think the term “utility function” meant?”

Keep in mind that the people I asked are somewhat more educated than the average US citizen. Everyone I asked had at least one year of college under their belts. The highest educated attained in the group is a master’s degree in education. There are also two computer science degrees among them. All of the respondents have a working familiarity with computers, and half have worked with computers professionally in one respect or another. The responses were all sent informally over text.

Responses (SP included- this was over text):

1. Use

If ai can actually “care”

2. What it is programed to do? I don’t truly understand AI so hard to answer.

3. Assimilate all information to provide answers to questions.

…which humans are asking 🤷🏻‍♀️

4. I would think that utility function is the purpose coded to the machine, or like the default functions of that AI.

5. It does what it’s programmed to. Do

6. That’s the core complication of alignment isn’t it?

I took the answers as-is and did not seek to ask anyone to clarify their statements, for fear of leading them to particular conclusions- these are the raw responses. I find them rather telling. Let’s go through the answers.

Answer 1. Use

If ai can actually “care”

I expected this to be the most common answer. To most people, the words “utility” and “function” have the same meaning, and it comes down to usefulness. Therefore, my original statement may sound to the layperson something like “The ai doesn’t care about you, it only cares about what is useful to it.” It’s not an unreasonable interpretation. The crux really comes down to- how would the AI determine what is useful, and what is not? Where does it come from?

Answer 2. What it is programed to do? I don’t truly understand AI so hard to answer.

Answer 5. It does what it’s programmed to. Do

These answers reflect what most people who have knowledge of and experience with computers will expect- they do what we tell them to do. We have the power to go in and fiddle with their programming and change or improve and ultimately, control them.

The idea that computers just do what we program them to do is, I believe, one of the biggest problems with educating the public about the alignment problem and the dangers we face from AI. They have learned from experience that computers are something we control, and so it’s difficult to conceptualize a computer going off in some odd direction, barring some easily fixable bug. If it’s not working right, just unplug it.

This may also be a reason why people are more worried about bad actors than the AI itself. We are the good guys, so we will create a good AI. We need to do that before the bad people create a bad AI.

Answer 4. I would think that utility function is the purpose coded to the machine, or like the default functions of that AI.

A rather good answer- it’s understood, here, that the utility function comes down to the “purpose” the machine has. But the problem, still, is the assumption that we are coding it. We control it. Though the respondent does reflect that, perhaps, the functions are default, there’s no idea as to where that “default” would come from, if not what is hard-coded into the machine.

Answer 6. That’s the core complication of alignment isn’t it?

I wondered if I should include this answer, but I will for completeness. Everyone on this list has heard me speak about AI at some point, but this respondent has heard me rant about it the most.

Answer 3. Assimilate all information to provide answers to questions.

…which humans are asking 🤷🏻‍♀️

I’m to blame for this answer, as I did not specify that I wanted the definition of utility function, rather than what the utility function actually would be.

Taking the answer as-is, it’s a logical assumption. The purpose of an LLM is to assimilate information and provide answers to questions. It is difficult to convey why a utility function would morph into a funhouse-mirror version of our intent, and I’ve seen a lot of educators attempt explain how this would occur. People want specific reasons why an AI’s purpose would go wrong, which leads us to a whack-a-mole situation; we see one direction an AI’s utility function might stray, so we patch that, then it goes in another direction, and we patch that, over and over until we’re feeling pretty secure. And then AI kills us.

Delving deeper into what people know and what they believe based on that may be key to surviving.

Clear language is important, and nearly impossible. AI researchers have tools at their disposal to make communication clear between them, such as a shared lexicon and mathematical representations of their ideas. AI educators don’t have these tools available to them. To the general public, the same word can mean many different things depending on context, the individual’s background, education level, and assumptions based on a diverse array of experiences and culture. A shared national language cannot even be taken for granted. The universal language of mathematics is almost useless in public education, as well- after the required maths courses from school are complete, the majority of people put that tool in the back of their closet, firm in the belief they will never actually ‘need’ it again.

Trying to find a set of words, phrases, and examples that will get your point across with enough accuracy to both enlighten and persuade a majority of people who listen is a daunting prospect, and I am not here to provide them. The truth is, in order to build a layperson understanding of a concept, consistency is key. The more people are presented with the same words and phrases in the same concept, the more they will build a shared lexicon they can follow, so it’s probably best to continue to use the same words and phrases we’ve been using, in the same manner they’ve been used, and try to paint a picture using those words.

The real question is whether there is enough time to slog through the usual methods of building a shared lexicon, explaining all of the terms used, and repeating them enough to saturate general consciousness the same way scientists have been for decades. I wouldn’t bet that there is very much time left. There may be key public influencers that can disseminate the information the best, but these same influencers are probably not very influential because they fill the airwaves with soundly-reasoned rhetoric. The message not only has to be understandable, it also must be memetic.

The message also must be persuasive. It’s not enough to understand the concepts, the public must believe them. And the belief must be strong enough to move them. There’s a goldilocks zone for the belief; it must be strong enough to inspire action, but not so strong that the public gives up, lays down, and surrenders to the void.

I imagine there are many climate change educators out there who will probably say “when you figure this out, let us know.” If there’s one avenue of hope I see, it’s that AI alignment doesn’t seem to be working against misinformation generated by a cadre of bad actors. People who are mistaken? Certainly. People who are dismissive in order to preserve social standing? Absolutely. But not malicious. Not yet, I hope.

But I hope, at least, that a map of the current situation will help. Educators need to choose clear, consistent, and near-universal language to both educate and persuade the public. The message must be easily spread, and the message must reach the right people as quickly as possible.

As I’ve mentioned before- I don’t have social media or an army of undergraduates. However, even assuming a limited time frame until AGI/ASI is established, I think it is important to follow the obvious next steps: namely, to do actual surveys and studies about how people understand key phrases and concepts in AI, to cluster the information via demographics, to filter your message based on those demographic groups, and to adjust based on the expected audience when communicating with the public.

LESSWRONG
LW

Meta Alignment: Education

3

New to LessWrong?

3