This is a report on our work in AISC Virtual 2023.
For AISC 2023, our team looked into the foundations of soft optimization. Our goal at the beginning was to investigate variations of the original quantilizer algorithm, in particular by following intuitions that uncertainty about goals can motivate soft optimization. We ended up spending most of the time discussing the foundations and philosophy of agents, and exploring toy examples of Goodhart’s curse.
Our discussions centered on the form of knowledge about the utility function that an agent must have, such that expected utility maximization isn’t the correct procedure (from the designer's perspective). With well-calibrated beliefs about the true utility function, it’s always optimal to... (read 4279 more words →)
I don't think we should laud too much Anthropic here. Because, if those pressures are so harsh that Anthropic is obliged to do behavior that they themselves would consider reckless, then making this public in the report is good but not enough, and they should emphasize somewhere that they themselves are being forced to take risks and push for concrete regulations on evals practice and similar, not some abstract "we need to regulate".
Edit: added verbs that somehow went missing