If that AI produces slop, it should be pretty explicitly aware that it's producing slop.
This part seems false.
As a concrete example, consider a very strong base LLM. By assumption, there exists some prompt such that the LLM will output basically the same alignment research you would. But with some other prompt, it produces slop, because it accurately predicts what lots of not-very-competent humans would produce. And when producing the sort of slop which not-very-competent humans produce, there's no particular reason for it to explicitly think about what a more competent human would produce. There's no particular reason for it to explicitly think "hmm, there probably exist more competent humans who would produce different text than this". It's just thinking about what token would come next, emulating the thinking of low-competence humans, without particularly thinking about more-competent humans at all.
How many of these failure modes still happen when there is an AI at least as smart as you, that is aware of these failure modes and actively trying to prevent them?
All of these failure modes apply when the AI is at least as smart as you and "aware of these failure modes" in some sense. It's the "actively trying to prevent them" part which is key. Why would the AI actively try to prevent them? Would actively trying to prevent them give lower perplexity or higher reward or a more compressible policy? Answer: no, trying to prevent them would not give lower perplexity or higher reward or a more compressible policy.
The alignment problem does not assume AI needs to be kept in check, it is not focused on control, and adaptation and learning in synergy are entirely compatible with everything said in this post. At a meta level, I would recommend actually reading rather than dropping GPT2-level comments which clearly do not at all engage with what the post is talking about.
I do agree that the end of the last glacial period was the obvious immediate trigger for agriculture. But the "humans are the stupidest thing which could take off model" still holds, because evolution largely operates on a slower timescale than the glacial cycle.
Specifics: the last glacial period ran from roughly 115k years ago to 12k years ago. Whereas, if you look at a timeline of human evolution, most of the evolution from apes to humans happens on a timescale of 100k - 10M years. So it's really only the very last little bit where an ice age was blocking takeoff. In particular, if human intelligence has been at evolutionary equilibrium for some time, then we should wonder why humanity didn't take off 115k years ago, before the last ice age.
Noting for the sake of later evaluation: this rough picture matches my current median expectations. Not very high confidence; I'd give it roughly 60%.
The evidence I have mentally cached is brain size. The evolutionary trajectory of brain size is relatively easy to measure just by looking at skulls from archaeological sites, and IIRC it has increased steadily through human evolutionary history and does not seem to be in evolutionary equilibrium.
(Also on priors, even before any evidence, we should strongly expect humans to not be in evolutionary equilibrium. As the saying goes, "humans are the stupidest thing which could take off, otherwise we would have taken off sooner". I.e. since the timescale of our takeoff is much faster than evolution, the only way we could be at equilibrium is if a maximal-under-constraints intelligence level just happened to be exactly enough for humans to take off.)
There's probably other kinds of evidence as well; this isn't a topic I've studied much.
IIUC human intelligence is not in evolutionary equilibrium; it's been increasing pretty rapidly (by the standards of biological evolution) over the course of humanity's development, right up to "recent" evolutionary history. So difficulty-of-improving-on-a-system-already-optimized-by-evolution isn't that big of a barrier here, and we should expect to see plenty of beneficial variants which have not yet reached fixation just by virtue of evolution not having had enough time yet.
(Of course separate from that, there are also the usual loopholes to evolutionary optimality which you listed - e.g. mutation load or variants with tradeoffs in the ancestral environment. But on my current understanding those are a minority of the available gains from human genetic intelligence enhancement.)
Alternative model: French mathematicians don't overperform in an objective sense. Rather, French mathematicians happened to end up disproportionately setting fashion trends in pure mathematics for a while, for reasons which are mostly just signalling games and academic politics rather than mathematical merit.
The Bourbaki spring to mind here as a central example.
That's a much more useful answer, actually. So let's bring it back to Eliezer's original question:
Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"? Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up. Just whatever simple principle you think lets you bypass GIGO.
[...]
Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from "I can verify whether this problem was solved" to "I can train a generator to solve this problem".
So to summarize your short, simple answer to Eliezer's question: you want to "train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end". And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
Or, to compact the summary even further: you want to train the somewhat-smarter-than-human AI on easily-verifiable synthetically-generated tasks, and then hope/expect that its good performance on those tasks generalizes to a problem which is not easily verifiable or synthetically generated, namely the problem of checking that a next generation of AI is in the basin of attraction of a good-to-humans outcome.
(Note: I know you've avoided talking about the basin of attraction of a good-to-humans outcome, instead focused on just some short-term goal like e.g. not being killed by the very next generation of AI. Not focusing on the basin of attraction is a mistake, and we can go into why it's a mistake if that turns out to be cruxy.)
In Eliezer's comment, he was imagining a training setup somewhat different from easily-verifiable synthetically-generated tasks:
Assume that whenever OpenPhil tries to run an essay contest for saying what they're getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks. How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?
... but the analogue of the problem Eliezer was highlighting, in the context of training on easily-verifiable synthetically-generated tasks, is the question: how and why would we justifiably trust that an AI trained on easily-verifiable synthetic tasks generalizes to not-easily-verifiable real-world tasks?
The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.
This seems very blatantly not viable-in-general, in both theory and practice.
On the theory side: there are plenty of computations which cannot be significantly accelerated via parallelism with less-than-exponential resources. (If we do have exponential resources, then all binary circuits can be reduced to depth 2, but in the real world we do not and will not have exponential resources.) Serial computation requirements are, in fact, a thing. So you can't just have a bunch of Eliezers do in 3 months what a single 6 month Eliezer could do, in general.
Even if you allow the 3-month Eliezer sims to act one-after-another, rather than trying to get everything done in 3 months simulated time via pure parallelism, there's still a tight communication bottleneck between each Eliezer sim. There are presumably plenty of circuits which cannot be implemented with tight information bottlenecks every n serial steps.
... of course you could circumvent all that theory by e.g. just having each Eliezer emulate a single logic gate, or some comparably trivial thing, but at that point you run afoul of the non-compositionality of safety properties: putting together a bunch of "safe" things (or "interpretable" things, or "aligned" things, or "corrigible" things, ...) does not in-general produce a safe thing.
So that's the theory side. What about the practice side?
Well, Ought did that roughly that experiment years ago and it did not work at all. And that should not be surprising - as the link argues, we have extremely ample evidence from day-to-day life that such things do not work in general.
I think you should address Thane's concrete example:
That seems to me a pretty damn solid knock-down counterargument. There were no continuous language model scaling laws before the transformer architecture, and not for lack of people trying to make language nets.