here's a straw hypothetical example where I've exaggerated both 1 and 2; the details aren't exactly correct but the vibe is more important:
1: "Here's a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment"
2: "Debate works if you can actually set the goals of the agents (i.e you've solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]"
1: "Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever"
2: "how are you going to do that? your scheme doesn't tackle inner alignment, which seems to contain almost all of the difficulty of alignment to me. the claim you just made is a separate claim from your main scheme, and the cleverness in your scheme is in a direction orthogonal to this claim"
1: "idk, also that's a fully general counterargument to any alignment scheme, you can always just say 'but what if inner misalignment'. I feel like you're not really engaging with the meat of my proposal, you've just found a thing you can say to be cynical and dismissive of any proposal"
2: "but I think most of the difficulty of alignment is in inner alignment, and schemes which kinda handwave it away are trying to some some problem which is not the actual problem we need to solve to not die from AGI. I agree your scheme would work if inner alignment weren't a problem."
1: "so you agree that in a pretty nontrivial number [let's say both 1&2 agree this is like 20% or something] of worlds my scheme does actually work- I mean how can you be that confident that inner alignment is that hard? in the world's where inner alignment turns out to be easy then my scheme will work."
2: "I'm not super confident, but if we assume that inner alignment is easy then I think many other simpler schemes will also work, so the cleverness that your proposal adds doesn't actually make a big difference."
a common discussion pattern: person 1 claims X solves/is an angle of attack on problem P. person 2 is skeptical. there is also some subproblem Q (90% of the time not mentioned explicitly). person 1 is defending a claim like "X solves P conditional on Q already being solved (but Q is easy)", whereas person 2 thinks person 1 is defending "X solves P via solving Q", and person 2 also believes something like "subproblem Q is hard". the problem with this discussion pattern is it can lead to some very frustrating miscommunication:
random brainstorming about optimizeryness vs controller/lookuptableyness:
let's think of optimizers as things that reliably steer a broad set of initial states to some specific terminal state seems like there are two things we care about (at least):
a LUT trained with a little bit of RL will be neither retargetable nor robust. a LUT trained with galactic amounts of RL to do every possible initial state optimally is robust but not retargetable (this is reasonable: robustness is only a property of the functional behavior so whether it's a LUT internally shouldn't matter; retargetability is a property of the actual implementation so it does matter). a big search loop (the most extreme of which is AIXI, which is 100% search) is very retargetable, and depending on how hard it searches is varying degrees of robustness.
(however, in practice with normal amounts of compute a LUT is never robust, this thought experiment only highlights differences that remain in the limit)
what do we care about these properties for?
[conjecture 1: retargetability == complexity can be decomposed == gradient of goal is meaningful. conjecture 2: gradient of goal is meaningful/complexity decomposition implies deceptive alignment (maybe we can also find some necessary condition?)]
how do we formalize retargetability?
random idea: the hypothesis that complexity can be approximately decomposed into a goal component and a reasoning component is maybe a good formalization of (a weak version of) orthogonality?
My prior, not having looked too carefully at the post or the specific projects involved, is that probably any claims that an open source model is 90% as good as GPT4 or indistinguishable are hugely exaggerated or otherwise not a fair comparison. In general in ML, confirmation bias and overclaiming is very common and as a base rate the vast majority of papers that claim some kind of groundbreaking result end up just never having any real impact.
Also, I expect facets of capabilities progress most relevant to existential risk will be especially constrained strongly by base model quality. I would agree that open source is probably better at squeezing stuff out of small models, but my model is that wrt existential risk relevant capabilities progress this is less relevant (cf the bitter lesson).
I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:
Which interventions make sense depends a lot on your precise model of why current models are not AGI, and I would consequently expect modelling things at the level of "LLMs vs not LLMs" to be less effective.
Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17). Jacob probably has more detailed takes on this than me.
In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent
takes on takeoff (or: Why Aren't The Models Mesaoptimizer-y Yet)
here are some reasons we might care about discontinuities:
I think these capture 90% of what I care about when talking about fast/slow takeoff, with the first point taking up a majority
(it comes up a lot in discussions that it seems like I can't quite pin down exactly what my interlocutor's beliefs on fastness/slowness imply. if we can fully list out all the things we care about, we can screen off any disagreement about definitions of the word "discontinuity")
some things that seem probably true to me and which are probably not really cruxes:
possible sources of discontinuity:
I think these can be boiled down to 3 more succinct scenario descriptions:
I think it's pretty unlikely (<5%) that decentralized volunteer training will be competitive with SOTA, ever. (Caveat: I haven't been following volunteer training super closely so this take is mostly cached from having looked into it for GPT-Neo plus occasionally seeing new papers about volunteer training).
(folding@home is embarrassingly parallelizable because it requires basically no interconnect, and therefore also doesn't mind heterogeneous compute or reliability)