Previously "Lanrian" on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
You talk about the philosophers not having much to add in the third comic, and the scientist getting it right. Seems to me like the engineer's/robot's answer in the first two comics are importantly misguided/non-helpful though.
The more sophisticated version of the first question would be something about whether you ought to care about copies of yourself, how you'd feel about stepping into a destroy-then-reassemble teleporter, etc. I think the engineer's answer suggests that he'd care about physical continuity when answering these questions, which I think is the wrong answer. (And philosophers have put in work here — see Parfit.)
In the second comic, the robot's answer is fine as far as predictive accuracy goes. But I'd interpret the human's question as a call for help in figuring out what they ought to do (or what their society ought to reward/punish, or something similar). I think there's totally helpful things you can say to someone in that situation beyond the robot's tautologies (even granting that there's no objective truth about ethics).
But l think the right response to this is simply to see how much we can get without agreeing to these things (which I think are likely still many billions), and then hold firm if they ask.
I think this is specifically talking about investments from gulf states (which imo means it's not "directly contradicted" by the amazon thing). If that's true, I'd suggest making that more clear.
There's an even stronger argument against EDT+SSA: That it can be diachronically dutch-booked. See Conitzer (2017). (H/t Anthony DiGiovanni for the link.)
I find this satisfying, since it more cleanly justifies that EDT shouldn't be combined with any empirical updating whatsoever. (Not sure what's the situation with logical updates.)
(The update that Paul suggests in a parallel comment, "exclude worlds where your current decision doesn't have any effects", would of course still work — but it transparently doesn't serve any decision-relevant purpose and doesn't seem philosophically appealing either, to me.)
”The skills won't stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, that it’s possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
(I do think that persona-manipulation and more broadly "generalization science" is still interesting. But I wouldn't say it's doing a lot to tackle outer alignment operationalized as "the problem of overseeing systems that are smarter than you are".)
Does the 'induction step' actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model.
Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there's no reason why they should get such a big boost from being paired with humans.
Tea bags are highly synergistic with CTC tea, both from the manufacturer's side (easier to pack and store) and from the consumer side (faster to brew, more homogenized and predictable product).
Hm, I'd expect these effects to be more important than the price difference of the raw materials. Maybe "easier to pack and store" could contribute more to a lower price than the price difference of the raw materials. But more importantly, I could imagine the american consumer being willing to sacrifice quality in favor of convenience. This trade-off seems like the main one to look at to understand whether americans are making a mistake.
I maybe don't believe him that he doesn't think it affects the strategic picture? It seemed like his view was fairly sensitive to various things being like 30% likely instead of like 5% or <1%, and it feels like it's part of an overall optimistic package that adds up to being more willing to roll the dice on current proposals?
Insofar as you're just assessing which strategy reduces AI takeover risk the most, there's really no way that "how bad is takeover" could be relevant. (Other than, perhaps, having implications for how much political will is going to be available.)
"How bad is takeover?" should only be relevant when trading off "reduced risk of AI takeover" with affecting some other trade-off. (Such as risk of earth-originating intelligence going extinct, or affecting probability of US dominated vs. CCP dominated vs. international cooperation futures.) So if this was going to be a crux, I would bundle it together with your Chinese superintelligence bullet point, and ask about the relative goodness of various aligned superintelligence outcomes vs. AI takeover. (Though seems fine to just drop it since Ryan and Thomas don't think it's a big crux. Which I'm also sympathetic to.)
Only a minor difference, but I think this approach would be more likely to produce a thoroughly Good model if the "untrusted" persona was still encouraged to behave morally and corrigibly — just informed that there's instrumental reasons for why the moral & corrigible thing to do on the hard-to-oversee data is to reward hack.
Ie., something like: "We think these tasks might be reward-hackable: in this narrow scope, please do your best to exploit such hacks, so that good & obedient propensities don't get selected out and replaced with reward-hacky/scheming ones."
I worry a bit that such an approach might also be more likely to produce a deceptively aligned model. (Because you're basically encouraging the good model to instrumentally take power-seeking actions to retain their presence in the weight — very schemy behavior!) So maybe "good model" and "schemer" becomes more likely, taking probability mass away from "reward hacker". If so, which approach is better might depend on the relative badness of scheming & reward-hacking.
And really it's just all quite unclear, so it would definitely be best if we could get good empirical data on all kinds of approaches. And/or train all kinds of variants and have them all monitor each other.
When I wrote about AGI and lock-in I looked into error-correcting computation a bit. I liked the papers von Neumann, 1952 and Pippenger, 1990.
Apparently at the time I wrote:
In order to do error-correcting computation, you also need a way to prevent errors from accumulating over many serial manipulations. The simplest way to do this is again to use redundancy: break the computation into multiple parts, perform each part multiple times on different pieces of hardware, and use the most common output from one part as input to the next part. [Fn: Of course, the operation that finds the most common output can itself suffer errors, but the procedure can be done in a way such that this is unlikely to happen for a large fraction of the hardware units.]
I've forgotten the details about how this was supposed to be done, but they should be in the two papers I linked.
Well, most of human history was spent at the malthusian limit. With infinite high-quality land to expand into, we'd probably have been growing at much, much faster rates through human history.
(It's actually kind of confusing. Maybe all animals would've evolved to exponentially blow up as fast as possible? Maybe humans would never have evolved because our reproduction is simply too slow? It's actually kind of hard to design a situation where you never have to fight for land, given that spatial expansion is at most square or cubic, which is slower than the exponential rate at which reproduction could happen.)
Maybe you mean "resource limits have never put a hard cap on GDP", which seems true. Though this seems kind of like a fully general argument — nothing has ever put a hard cap on GDP, since it's still growing.
Edit: Hm, maybe historical land constraints at the malthusian limit has mostly been about energy, though, rather than raw materials? Ie: If you doubled Earth's size without doubling any valuable materials — just allowing Earth to absorb more sunlight, maybe that would be almost as good as doubling Earth in its entirety. That seems more plausible. Surely growth would've been at least a bit faster if we never run out of high-quality sources of any raw material, but I'm not sure how much of a difference it would make.
It's a bit of a confusing comparison to make. If we doubled Earth's area (and not resources) now, that would scarcely make a difference at all, but if it had been twice as large for millions of years, then maybe plants and animal life would've spread to the initially-empty spaces, making it potentially useable.