Hi there,

I am new to both AI and computer programming in general, and have been teaching myself about AI alignment. I am wondering if there is a difference between AI misalignment and "bad" or uninformed programming.

My understanding of AI misalignment is that it arises more or less from a failure of the programmer to communicate precisely what they want the AI to do. After all, if a computer behaves only according to how it is programmed, then I would think that all AI misalignment problems could be traced back to the code running the AI.

For example, imagine I build a computer program that learns to extract identifying information about website users and share it with other users. However, I forget to encode a safety feature that stops criminals from accessing the information. Now, my program is sharing sensitive information with criminals. Does that mean my program is misaligned, or is it just poorly written?

New Answer
Ask Related Question
New Comment

2 Answers sorted by

It's sort of like blaming bad music on "bad note-playing." After all, if the band had just played better notes, the music would have been good.

One reason blaming bad music on bad note-playing sounds weird is that we expect such blames / diagnoses to convey information about the process of fixing the problem.

When it takes an hour of sitting at the computer and doing "normal programming" to fix a program that wasn't doing what you wanted, we call it a programming problem.

When getting the program to do what you want takes talking to the stakeholders, we call it a specification problem.

When it takes doing original research on speed and output quality, and maybe standing at a whiteboard doing math, we call it an algorithm problem.


People doing research on algorithms would probably be similarly perplexed by the question "is a slow algorithm just bad programming?"

There are two unresolved issues with alignment:

  • we do not know how to give a sufficiently precise definition (plenty of naive and not so naive attempts, but all have major flaws),
  • even if we had a definition, we have no way of building an AI that would actually follow it reliably (today's AIs are not so much programmed, but rather discovered by an almost blind semi-random search, and while they are often "good enough", they are never exactly right, and failure modes are fairly unpredictable)

And yes, even if we had the answers for the above two questions, we'd still need to make sure the code implementing them is programmed well, but compared to the above too issues, "programming well, once you are willing to spend 100x cost/LoC" is much closer to being a solved problem (using techniques such as formal verification).