CS PhD student
"We're building something much smarter than us" understates the threat. After all we have people with IQ-140 co-existing happily with people with IQ-60 — and entities like corporations and markets which aggregate or tile this intelligence in various ways — what could a few more IQ points do?
Perhaps a phrasing that better communicates the intelligence explosion thesis is: we're creating the technology that lets an IQ-140 being create an IQ 150 being.
… and the question is: how does this technology get “generalized”? Vaguely:
I don't understand. The hard problem of alignment/CEV/etc. is that it's not obvious how to scale intelligence while "maintaining" utility function/preferences, and this still applies for human intelligence amplification.
I suppose this is fine if the only improvement you can expect beyond human-level intelligence is "processing speed", but I would expect superhuman AI to be more intelligent in a variety of ways.
Something that seems like it should be well-known, but I have not seen an explicit reference for:
Goodhart’s law can, in principle, be overcome via adversarial training (or generally learning Multi-Agent Systems)
—aka “The enemy is smart.”
Goodhart’s law only really applies to a “static” objective, not when the objective is the outcome of a game with other agents who can adapt.
This doesn’t really require the other agents to act in a way that continuously “improves” the training objective either, it just requires them to be able to constantly throw adversarial examples to the agent forcing it to “generalize”.
In particular, I think this is the basic reason why any reasonable Scalable Oversight protocol would be fundamentally “multi-agent” in nature (like Debate).
I think only particular reward functions, such as in multi-agent/co-operative environments (agents can include humans, like in RLHF) or in actually interactive proving environments?
Yes, I also realized that "ideas" being a thing is due to bounded rationality -- specifically they are the outputs of AI search. "Proofs" are weirder though, and I haven't seen them distinguished very often. I wonder if this is a reasonable analogy to make:
There is a cliche that there are two types of mathematicians: "theory developers" and "problem solvers". Similarly, Robin Hanson divides the production of knowledge into "framing" and "filling".
It seems to me there are actually three sorts of information in the world:
From a strictly Bayesian perspective, there seems to be no "fundamental" difference between these forms of information. They're all just things you condition your prior on. Yet this division seems to be natural in quite a variety of informational tasks. What gives?
adding this from replies for prominence--
Yes, I also realized that "ideas" being a thing is due to bounded rationality -- specifically they are the outputs of AI search. "Proofs" are weirder though, and I haven't seen them distinguished very often. I wonder if this is a reasonable analogy to make:
I don't think that AI alignment people doing "enemy of enemy is friend" logic with AI luddites (i.e. people worried about Privacy/Racism/Artists/Misinformation/Jobs/Whatever) is useful.
Alignment research is a luxury good for labs, which means it would be the first thing axed (hyperbolically speaking) if you imposed generic hurdles/costs on their revenue, or if you made them spend on mitigating P/R/A/M/J/W problems.
This "crowding-out" effect is already happening to a very large extent: there are vastly more researchers and capital being devoted to P/R/A/M/J/W problems, which could have been allocated to actual alignment research! If you are forming a "coalition" with these people, you are getting a very shitty deal -- they've been much more effective at getting their priorities funded than you have been!
If you want them to care about notkilleveryoneism, you have to specifically make it expensive for them to kill everyone, not just untargetedly "oppose" them. E.g. like foom liability.
The third virtue of rationality, lightness, is wrong. In fact: the more you value information to change your mind on some question, the more obstinate you should be to changing your mind on that question. Lightness implies disinterest in the question.
Imagine your mind as a logarithmic market-maker which assigns some initial subsidy b(Q) to any new question Q. This subsidy parameter captures your marginal value for information on Q. But it also measures how hard it is to change your mind — the cost of moving your probability from p to p′ is bmin[log1−p1−p′,logpp′].
What would this imply in practice? It means that each individual “trader” (both internal mental heuristics/thought patterns, and external sources of information/other people) will generally have a smaller influence on your beliefs, as they may not have enough wealth. Traders who influence your belief will carry greater risk (to their influence on you in future), though will also earn more reward if they’re right.