1992

LESSWRONG
LW

1991
Agent FoundationsEpistemologyEthics & MoralityFree Energy PrincipleGoalsOuter AlignmentAIWorld Optimization
Frontpage

20

Goal alignment without alignment on epistemology, ethics, and science is futile

by Roman Leventov
7th Apr 2023
3 min read
2

20

20

Goal alignment without alignment on epistemology, ethics, and science is futile
1winstonne
2Roman Leventov
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:11 PM
[-]winstonne2y10

In theory, if humans and AIs aligned on their generative models (i.e., if there is methodological, scientific, and fact alignment), then goal alignment, even if sensible to talk about, will take care of itself: indeed, starting from the same "factual" beliefs, and using the same principles of epistemology, rationality, ethics, and science, people and AIs should in principle arrive at the same predictions and plans.

 

What about zero sum games? If you took took an agent, cloned it, then put both copies into a shared environment with only enough resources to support one agent, they would be forced to compete with one another. I guess they both have the same "goals" per se, but they are not aligned even though they are identical.

Reply
[-]Roman Leventov2y20

Yes, this is a valid caveat and the game theory perspective should have been better reflected in the post.

Reply
Moderation Log
More from Roman Leventov
View more
Curated and popular this week
2Comments
Agent FoundationsEpistemologyEthics & MoralityFree Energy PrincipleGoalsOuter AlignmentAIWorld Optimization
Frontpage

Originally posted as a comment to the post "Misgeneralization as a misnomer" by Nate Soares.


I start to suspect that the concepts of "goals" or "pursuits", and, therefore, "goal alignment" or "goal (mis)generalisation" are not very important.

In Active Inference ontology, the concepts of predictions (of the future states of the world) and plans (i.e., {s_1, a_1, s_2, a_2, ...} sequences, where s_i are predicted states of the world, and a_i are planned actions) are much more important than "goals". Active Inference agents contemplate different plans and ultimately end up performing a first step in the plan (or a set of plans, marginalizing out the probability mass of the plans) that appears to minimise the free energy functional.

Intermediate s_i states of the worlds in the plans contemplated by the agent can be seen as "goals", but the important distinction is that these are merely tentative predictions that could be changed or abandoned at every step.

Thus, the crux of alignment is aligning the generative models of humans and AIs. Generative models could be "decomposed", vaguely (there is a lot of intersection between these categories), into

  • Methodology: the mechanics of the models themselves (i.e., epistemology, rationality, normative logic, ethical deliberation),
  • Science: mechanics, or "update rules/laws" of the world (such as the laws of physics or the heuristical learnings about society, economy, markets, psychology, etc.), and
  • Fact: the state of the world (facts, or inferences about the current state of the world: CO2 level in the atmosphere, the suicide rate in each country, distance from Earth to the Sun, etc.)

These, we can conceptualise, give rise to "methodological alignment", "scientific alignment", and "fact alignment" respectively. Evidently, methodological alignment is most important: it in principle allows for alignment on science, and methodology plus science helps to align on facts.

In theory, if humans and AIs aligned on their generative models (i.e., if there is methodological, scientific, and fact alignment), then goal alignment, even if sensible to talk about, will take care of itself: indeed, starting from the same "factual" beliefs, and using the same principles of epistemology, rationality, ethics, and science, people and AIs should in principle arrive at the same predictions and plans.

Conversely, if methodological and scientific alignment is poor (fact alignment, as the least important, should take care of itself at least if methodological and scientific alignment is good), it's probably futile to try to align on "goals": it's just bound to "misgeneralise" or otherwise break down under different methodologies and scientific views.

And yes, it seems like to even have a chance to align on methodology, we should first learn it, that is, develop a robust theory of intelligent agents where sub-theories of epistemology, rationality, logic, and ethics cohere together. I.e., it's MIRI's early "blue sky" agenda of "solving intelligence".

Concrete example: "happiness" in the post sounds like a "predicted" future state of the world (where "all people are happy"), which implicitly leverages certain scientific theories (e.g., what does it mean for people to be happy), epistemology (how do we know that people are happy), and ethics: is the predicted plan of moving from the current state of the world, where not all people are happy, to the future state of the world where all people are happy, conforms with our ethical and moral theories? Does it matter how many people are happy? Does it matter whether other living being become unhappy in the course of this plan, and to what degree? Does it matter that AIs are happy or not? Wouldn't it be more ethical to "solve happiness" or "remove unhappiness" via human-AI merge, mind upload, or something else like that? And on and on.

Thus, without aligning with AI on epistemology, rationality, ethics, and science, "asking" AIs to "make people happy" is just a gamble with infinitesimal chances of "winning".

Mentioned in
18Annotated reply to Bengio's "AI Scientists: Safe and Useful AI?"
16An LLM-based “exemplary actor”
13AI interpretability could be harmful?
12H-JEPA might be technically alignable in a modified form