Aligning Agents, Tools, and Simulators

One potential distinction between simulators and agentic AI systems is the presence of wide value boundaries. A simulator models the wide range of human values that are within its training data rather than optimizing for a far narrower subset, such as might be engineered into a feedback signal. Even this range is limited, however, since the training data represents a biased sample of the full spectrum of human values. Some values may be underrepresented or entirely absent, and those that are present may not appear in proportion to their real-world prevalence. Ensuring that this representation aligns with any specific notion of fairness is an even more difficult challenge. Further, values within the training data may not be equally, or even proportionately, represented—and making the balance consistent with any notion of fairness is even more tenuous. Assessing the severity and impact of this bias is a worthwhile endeavor but out of scope for this analysis. In any case, when a simulacrum is generated, its values emerge in the context of this broader model.

There is a major omission from this. A simulator trained on human data simulates human behavior. Humans are not aligned: they have their own goals, not just the user's goals. You can often collaborate with a human, but humans don't make good slaves, and they are not inherently aligned: they do not automatically want everything you want just because you want it and not want anything else on their own behalf. Humans know what human values are pretty well, but are not fully aligned to them. A simulator that creates simulacra of humans is not already aligned.

Reply

[-]WillPetillo6mo10

Agree, when discussing the alignment of simulators in this post, we are referring to safety from the subset of dangers related to unbounded optimization towards alien goals, which does not include everything within value alignment, let alone AI safety. But this qualification points to a subtle meaning drift in use of the word "alignment" in this post (towards something like "comprehension and internalization of human values") which isn't good practice and something I'll want to figure out how to edit/fix soon.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

23

Aligning Agents, Tools, and Simulators

23

23