LESSWRONG
LW

Agent FoundationsOrthogonality ThesisAI
Frontpage

0

Thou shalt not command an alighned AI

by Martin Vlach
11th May 2025
2 min read
4

0

Agent FoundationsOrthogonality ThesisAI
Frontpage

0

Thou shalt not command an alighned AI
3Robert Cousineau
1Martin Vlach
2Robert Cousineau
2Robert Cousineau
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:16 AM
[-]Robert Cousineau4mo30

Here is a copy edited version from Claude: 

Sorry, You Should Not Command the Aligned AI
By Martin Vlach, Benjamin Schmidt
May 11, 2025
2 min read

Benjamin slumps in his chair, visibly tired. "I don't think we even know what alignment is. We can't even define it properly."

I straighten up across the table at the Mediterranean restaurant. "I disagree. Give me three seconds and I can define it."

"Fine," he says after a pause.

"Can we narrow it to alignment of AI to humans?" I ask.

"Yes, let's narrow it to alignment of one AI to one person."

"The AI is aligned if you give it a goal and it pursues that goal without modifying it with its own intentions or goals."

Benjamin frowns. "That sounds far too abstract."

"In what sense?"

"Like the goal—what is that, more precisely?"

"A state of the world you want to achieve, or a series of states."

"But how would you specify that?"

"You can describe it in infinitely many ways. There's a scale of detail you can choose, which implies a level of approximation of the state."

"That won't describe the state completely, though?"

"Well, maybe if you could describe to the quantum state level, but that's obviously impractical."

"So then the AI must somehow interpret your goal, right?"

"Not exactly, but you mean it would have to interpolate to fill in the under-specified parts of your goal description?"

"Yes, that's a good way to put it."

"Then what we've discovered is another axis, orthogonal to alignment, which controls to what level of under-specification we want the AI to interpolate versus where it needs to ask you to fill in gaps before pursuing your goal."

"We can't be saying 'Create a picture of a dog' and then need to specify each pixel."

"Of course not. But perhaps the AI should ask whether you want the picture on paper or digitally, using a reasonable threshold for necessary clarification."

"People want things they don't actually need though..."

"And they can end up in a bad state even with an aligned AI."

"So how do you make alignment guarantee good outcomes? People are stupid..."

"And that's on them. You can call it incompetence, but I'd call it misuse."​​​​​​​​​​​​​​​​

Reply
[-]Martin Vlach4mo*10

You mean the chevrons like this is non-standard, but also sub-standard, although it has the neat property to represent >Speaker one< and >>Speaker two<<? I can see the typography of those here is meh at best.-\

Reply
[-]Robert Cousineau4mo20

I personally have not seen that style of writing dialogue before, and did not recognize that was what you were doing until reading this comment from you. It along with the typos made it difficult for me to understand, so I had Claude copy edit it for me (and then figured maybe someone else would find that useful).  

Reply
[-]Robert Cousineau4mo20

In response to what I understand to be your question ("So what do you do to make the alignment guarantee good outcomes? People are stupid.."), I think one commonly accepted answer here is: 

Yes, that is a real problem.  Something like CEV offers a solution (with a spherical cow, in a vacuum).  

There is also a useful differentiation to be made between Inner Alignment and Outer Alignment.

Reply
Moderation Log
More from Martin Vlach
View more
Curated and popular this week
4Comments

Raymond is tired. He exhails exhaustly: >>I don't think we even know what alighnment is, like we are not able to define it.<<

I hop up on my chair in the meditarian restaurant: >I disagree, if you give me 3 seconds, I can define it.<

>>---<< 

>Can we narrow it to alighnment of AI to humans?<

>>Yes, let's narrow it to alighnment of one AI to one person.<<

>Fine. The AI is alighned if you give it a goal and it follows towards that goal without modifying it with its own intentions or goals.<

>>That sounds a bit way too abstract...<<

>Yeah, but in what sense, you mean?<

>>Like the goal, what is that, more precisely?<< 

>That is a state of the world you want to achieve or a series of states of the world.<

>>Oh, but how would you specify that?<<

>You can specify it, describe it, in infinetly many ways, there is a scale of how detailed description you choose, which will imply a level of approximation of the state.<

>>Oh, but that won't describe the state completely..?<< 

>Well maybe if you can describe to the quantum state level, but surely that is not practical.<

>>So then the AI must somehow interpret your goal, right?<< 

>Ehmmm, well no, but what you mean it would have to interpolate to fill in the under-specified spots in the description of your goal..?<

>>Yes, that is a good expression for what would need to happen.<< 

>Then what we've discovered here is another axis, orthogonal to alighnment, which would control to what level of under-specifiedness we want the AI to interpolate and where it would need to ask you to fill in the gaps (more) before moving towards your goal.<

>>Oh, but we also can't be like "Create a picture of a dog" and then 'd need to specify each pixel.<<

>Sure. But maybe the AI must ask you whether you want the picture on paper or digitally on your screen, with a reasonable threshold for clarification.<

>>Hmm, but people want things they do not have...<<

>and they can end up in a state they feel bad in with an alighned AI.<

>>So what do you do to make the alighment guarantee good outcomes? People are stupid...<<

>and that's on them. You can call it incomptetence, but I'd call that a misuse.<