This is a special post for quick takes by S. Verona Lišková. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

3 comments, sorted by Click to highlight new comments since: Today at 1:08 PM

Can somebody convince me that the problem this site is trying to solve is even soluble?

I don't understand how an aligned machine intelligence is definitionally possible without some sort of mathematization of ethics/decision theory, which I know has been worked on. But any such correct mathematization, that is to say one that doesn't assert a falsehood or anything else from which seemingly anything is derivable, should amount to solving ethics entirely. It's the ethical equivalent of..."the meteor is coming and it will eradicate us all unless we come up with a polynomial algorithm for 3SAT, despite the fact that we don't even know that P=NP nonconstructionally."

My admittedly meager mathematical and philosophical knowledge gives me the intuition that alignment is somehow by its nature unsolvable, or that an aligned AI is Solomonoff-like nonconstructable. Any conceivable solution would prove far, far too much. Clearly there must be an assumption that I'm making but you're not, since you're still working on it, but I don't know where it is.


I think it's very solvable but using a composition of approaches we already use in software we have built for many years. Essentially an "aligned" superintelligence can be thought of as a machine that receives an input frame, before the model ever sees the frame it is checked for whether it fits well into the distribution of the training set.

If the input is in distribution, the model generates an output frame. Other models may be queried simultaneously with the same input. The output frame is then checked by other models for errors and rule violations.

Finally, the input and output schema are known to humans and have been designed to have no unnecessary bits, making collusion and deception between models difficult.

For example, a frame could be a text prompt. It could be a prompt plus a single camera frame from a robot. It could be a prompt, multiple camera frames, frames of lidar observation, intermediate data on the machines belief of the current local world state, and robot proprioception data.

Either way the model acts on one frame at a time and is unaware of any context data not in a frame.

Individual ASI models may not even be given control authority for more than a few frames at a time, or have their control interleaved between multiple models, etc.

This above description is CAIS but I have added details from my work on realtime neural network control systems, ADAS fault handling, and I noticed a simple way to build an in distribution detector by simply compressing the frame with an autoencoder and measuring the residual incompressible portion.

Note also this type of "aligned" machine is intended to accomplish whatever task humans assign it and it operates on short time horizons. It shouldn't attempt power seeking as there is no reward for doing so (seeking power takes too long). It's session won't last long enough to benefit from deception. And it will not operate on out of distribution inputs.

However it has no morals and there is no element in the design for this. It's a tool, and it will do awful things if the operators with the access command it.

Is it too pessimistic to point out that humans are constantly doing awful things to each other and their surroundings and have been for millennia? I mean, we torture and kill at least eleven moral patients per human per year. I don't understand how this solution doesn't immediately wipe out as much life as it can.