You are completely correct. This approach cannot possibly create an AI that matches a fixed specification.
This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior.
You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to ...
🧛♂️ 💊
Consider the parable of the vampire pill: would you take a vampire pill that would give you great strength and youth and great intelligence and great hair etc, but would invert your values such that you torturing people forever, starting with whoever you care most about now and slowly moving down the list? Then once they’re nearly dead, forcing them to take the vampire-pill, to propagate the wave of torture-murder further? Vampire-you will feel great about it, vampire-you will experience great positive utility in their frame. Vampires a... (read more)