I'm not seeing how you are drawing that from my numbers
People sometimes talk about "alignment by default" — the idea that we might solve alignment without any special effort beyond what we'd ordinarily do. I think it's useful to decompose this into three theses, sorted from strong to weak:
My rough credences:
These are rough and the categories blur into each other, but the decomposition seems useful for locating where exactly you think the hard problem lies.
I mention cross-lab monitoring here:
Cross-lab monitoring - Each lab uses their own AI as the primary (e.g. Claude), and asks other labs to monitor the safety-critical activity (e.g. ChatGPT, Gemini, Llama, Grok). Furthermore, cross-lab monitoring would also mitigate AI-enabled coups. Note that cross-lab monitoring requires new security arrangements, such as provably-secure third-party servers.
The tech would need to be a bit messy because you'll need to send your competitor a "zero-knowledge proof" that you are using their AI only for monitoring catastrophic actions, and not for automating R&D, without leaking what your AIs are doing.
Potential setup:
no one can see it other than the people you shared the doc with
Nice -- you probably win some sharpley for the interviewer's question
Please stop sharing google docs for comments
Instead: post the draft online, then share the link so people can comment in public.
I only share a google doc is if there's a specific person whose comments I want before posting online. But people often share these google docs in big slack channels — at that point, just post online!
I think it slows innovation.
MMLU knowledge, fit well enough, requires inventing the universe
I suspect "fit well enough" doesn't track anything in reality.
Jerry Wei writes:
We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn't enough assurance against misuse).
I think his critique is this:
Suppose we had a perfect filtering system, such that the dangerous knowledge has zero mutual information with the model weights:
Nonetheless, the dangerous knowledge is "accessible" to the agent via web search + tools + in-context reasoning.
To solve this problem, we need either alignment techniques (e.g. train the model not to use these affordances) or inference-time monitoring techniques (e.g. constitutional classifiers). But if we had those techniques then we don't need the pretraining filtering.
I think the most likely solution is that your current viewpoint is actually typical. Most human experience in the universe is being a key figures in the transition to terrestrial superintelligence.