I was wondering about some very dumb argument on alignment but no one answered to me.

I am in the research of reasons of why an intelligent agent would not keep us alive and try to help us align itself accordingly to our values, in the goal of preventing being the agent experimenting with aligning potentially smarter replicas, but I can't find anything on the subject. Is this hypothesis outside of the realm of possibilities ?

I am actually wondering if you align an agent with your own values, but not at the point of self-improvement, the said agent could have interest into helping you keeping him aligned with his current goal, in order to diminish the cost of experiment of him creating a replica.

As a note, couldn't this work too with agent not aligned with us ? Themselves helping us align them with our goals as a proof of concept of alignment ?

Him trying to align another powerful agent, all by itself, might be doomed to fail, that sound a lil crazy but I think there might be something there, the motto : experimenting on one's self in order to mitigate experimentation cost, there might be less cost to work with humans than with other really intelligent agents.

Less costs in terms of collaboration rewards too.

Am I missing something If I say that even if the model solves the alignment problem by itself, by not testing the alignment solution on itself by the proxy of a less intelligent species, it mitigates its chances of survival. Even if the mathematical proof for the alignment solution gets to be solid, there could be some very small offset of the replica going wrong, and the only way to take the risk would be to use some dumbie human.

There's one issue to that idea though, it would be if the agent realize that it can apply that logic on it's descendant too, and that every descendant can apply it recursively, what would that imply ?

As there's mostly incentives to create a descendant, and try to align it by the same reasoning we could use, the descendant could reason in the same way and dedicate ressources into trying to create another agent again, and again. Wondering if as every agent would know of that phenomena that what would result would be a state of inaction by constant replication, just getting caught in the wrong instrumental goal.

As a note I guess that the system would also know, the possibility of him not getting to it's goal by getting caught in a replication instrumental goal, that would likely monopolize all existing ressources

So.. maybe the better action would be inaction, knowing that if it begins to self-replicate, there will be no end to it, as if every model begins to think that the next will not replicate it might start to replicate, a implicit consensus could be achieved by not replicating at all before getting aligned by a sub-species ?

But it could explain the need for a biological bootloader being a lil bit too dumb to take in consideration the unconsiderate amount of risk by booting this process ? The key difference being here that whatever is the risk, dumby human will create that system because its dumb, as opposite to the system

If I'm anywhere near the truth, maybe the recommendation would be : create a smart system, don't make it dumb mathematically speaking, it would need to be von-neumann lvl


New Answer
Ask Related Question
New Comment
1 comment, sorted by Click to highlight new comments since: Today at 8:02 PM

Perhaps I don't understand what you meant, but it sounds to me like wishful thinking. For example, the agent could believe (perhaps correctly) itself to be smarter that humans, so it could believe it can do right what humans did wrong. Or it could simply do something other than try creating a descendant.

So, although hypothetically the agent could do X, there is no specific reason to privilege this possibility. There are thousand other things the agent could do.

New to LessWrong?