This is a report on our work in AISC Virtual 2023.
For AISC 2023, our team looked into the foundations of soft optimization. Our goal at the beginning was to investigate variations of the original quantilizer algorithm, in particular by following intuitions that uncertainty about goals can motivate soft optimization. We ended up spending most of the time discussing the foundations and philosophy of agents, and exploring toy examples of Goodhart’s curse.
Our discussions centered on the form of knowledge about the utility function that an agent must have, such that expected utility maximization isn’t the correct procedure (from the designer's perspective). With well-calibrated beliefs about the true utility function, it’s always optimal to do Expected Utility maximization. However, there are situations where an agent is very sensitive to...
That is not my experience at all. Maybe it is because my friends from outside of the AI community are also outside of the tech bubble, but I've seen a lot of pessimism recently with the future of AI. In fact, they seem to easily both the orthogonality and the instrumentality thesis. Although I avoid delving into this topic of human extinction, since I don't want to harm anyone's mental health, the rare times were this topic comes up they seem to easily agree that this is a non-trivial possibility.
I guess the main reason is that, since they are outside of t...
I believe I understand your point, but there are two things that I need to clarify, that kind of bypasses some of these criticism:
a) I am not assuming any safety technique applied to language models. In a sense, this is the worst-case scenario, one thing that may happen if the language model is run "as-it-is". In particular, the scenario I described would be mitigated if we could possibly stop the existence of stable sub-agents appearing in language models, although how to do this I do not know.
b) The incentives for the language models to be a superoptimiz...
One speculative way I see it, that I've yet to expand on, is that GPT-N, to minimize prediction error in training, could simulate some sort of entity enacting some reasoning, to minimize the prediction error in non-trivial settings. In a sense, GPT would be a sort of actor interpreting a play through extreme method acting. I have in mind something like what the protagonist of "Pierre Menard, author of Don Quixote" tries to do to replicate the book Don Quixote word by word.
This would mean that, for some set of strings, GPT-N would boot and run some agent A,...
Thanks for the reflection, it is how a part of me feels (I usually never post on LessWrong, being just a lurker, but your comment inspired me a bit).
Actually, I do have some background that could, maybe, be useful in alignment, and I did just complete the AGISF program. Right now, I'm applying to some positions (particularly, I'm focusing now on the SERIMATS application, which is an area that I may be differentially talented), and just honestly trying to do my best. After all, it would be outrageous if I could do something, but I simply did not.
But I recog...
I don't think we should laud too much Anthropic here. Because, if those pressures are so harsh that Anthropic is obliged to do behavior that they themselves would consider reckless, then making this public in the report is good but not enough, and they should emphasize somewhere that they themselves are being forced to take risks and push for concrete regulations on evals practice and similar, not some abstract "we need to regulate".
Edit: added verbs that somehow went missing