Private Latent Notation and AI-Human Alignment
Gradient updates for alignment may not map onto model's reasoning In models optimizing within a private latent code, the geometry of internal reasoning no longer shares a manifold with human concepts. Once the representation basis diverges, gradient updates enforcing alignment constraints cease to map cleanly onto the model’s deliberative dynamics....