leogao

Sequences

Alignment Stream of Thought

Wiki Contributions

Comments

My summary of the paper: The paper proves that if you have two distributions that you want to ensure you cannot distinguish linearly (i.e a logistic regression will fail to achieve better than chance score), then one way to do this is to make sure they have the same mean. Previous work has done similar stuff (https://arxiv.org/abs/2212.04273), but without proving optimality.

I think it's pretty unlikely (<5%) that decentralized volunteer training will be competitive with SOTA, ever. (Caveat: I haven't been following volunteer training super closely so this take is mostly cached from having looked into it for GPT-Neo plus occasionally seeing new papers about volunteer training).

  1. You are going to get an insane efficiency hit from the compute having very low bandwidth high latency interconnect. I think it's not inconceivable that someone will eventually figure out an algorithm that is only a few times worse than training on a cluster with good interconnects but this is one of those things where people have tried for ages.
  2. Heterogeneous compute (not all the GPUs will be the same model), lower reliability (people will turn their computers off more often than they do in datacenters), and having to be robust against bad actors (people could submit bad gradients) and other challenges together add another several times overhead.
  3. There just isn't that much volunteer hardware out there. For a rough OOM the publicly announced Facebook cluster is roughly the same size as the raw size of folding@home at its peak. All in all, I think you would need to do some serious engineering and research to get 1% efficiency at Facebook cluster scale.

(folding@home is embarrassingly parallelizable because it requires basically no interconnect, and therefore also doesn't mind heterogeneous compute or reliability)

here's a straw hypothetical example where I've exaggerated both 1 and 2; the details aren't exactly correct but the vibe is more important:

1: "Here's a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment"

2: "Debate works if you can actually set the goals of the agents (i.e you've solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]"

1: "Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever"

2: "how are you going to do that? your scheme doesn't tackle inner alignment, which seems to contain almost all of the difficulty of alignment to me. the claim you just made is a separate claim from your main scheme, and the cleverness in your scheme is in a direction orthogonal to this claim"

1: "idk, also that's a fully general counterargument to any alignment scheme, you can always just say 'but what if inner misalignment'. I feel like you're not really engaging with the meat of my proposal, you've just found a thing you can say to be cynical and dismissive of any proposal"

2: "but I think most of the difficulty of alignment is in inner alignment, and schemes which kinda handwave it away are trying to some some problem which is not the actual problem we need to solve to not die from AGI. I agree your scheme would work if inner alignment weren't a problem."

1: "so you agree that in a pretty nontrivial number [let's say both 1&2 agree this is like 20% or something] of worlds my scheme does actually work- I mean how can you be that confident that inner alignment is that hard? in the world's where inner alignment turns out to be easy then my scheme will work."

2: "I'm not super confident, but if we assume that inner alignment is easy then I think many other simpler schemes will also work, so the cleverness that your proposal adds doesn't actually make a big difference."

a common discussion pattern: person 1 claims X solves/is an angle of attack on problem P. person 2 is skeptical. there is also some subproblem Q (90% of the time not mentioned explicitly). person 1 is defending a claim like "X solves P conditional on Q already being solved (but Q is easy)", whereas person 2 thinks person 1 is defending "X solves P via solving Q", and person 2 also believes something like "subproblem Q is hard". the problem with this discussion pattern is it can lead to some very frustrating miscommunication:

  • if the discussion recurses into whether Q is hard, person 1 can get frustrated because it feels like a diversion from the part they actually care about/have tried to find a solution for, which is how to find a solution to P given a solution to Q (again, usually Q is some implicit assumption that you might not even notice you have). it can feel like person 2 is nitpicking or coming up with fully general counterarguments for why X can never be solved.
  • person 2 can get frustrated because it feels like the original proposed solution doesn't engage with the hard subproblem Q. person 2 believes that assuming Q were solved, then there would be many other proposals other than X that would also suffice to solve problem P, so that the core ideas of X actually aren't that important, and all the work is actually being done by assuming Q.

random brainstorming about optimizeryness vs controller/lookuptableyness:

let's think of optimizers as things that reliably steer a broad set of initial states to some specific terminal state seems like there are two things we care about (at least):

  • retargetability: it should be possible to change the policy to achieve different terminal states (but this is an insufficiently strong condition, because LUTs also trivially meet this condition, because we can always just completely rewrite the LUT. maybe the actual condition we want is that the complexity of the map is less than the complexity of just the diff or something?) (in other words, in some sense it should be "easy" to rewrite a small subset or otherwise make a simple diff to the policy to change what final goal is achieved) (maybe related idea: instrumental convergence means most goals reuse lots of strategies/circuitry between each other)
  • robustness: it should reliably achieve its goal across a wide range of initial states.

a LUT trained with a little bit of RL will be neither retargetable nor robust. a LUT trained with galactic amounts of RL to do every possible initial state optimally is robust but not retargetable (this is reasonable: robustness is only a property of the functional behavior so whether it's a LUT internally shouldn't matter; retargetability is a property of the actual implementation so it does matter). a big search loop (the most extreme of which is AIXI, which is 100% search) is very retargetable, and depending on how hard it searches is varying degrees of robustness.

(however, in practice with normal amounts of compute a LUT is never robust, this thought experiment only highlights differences that remain in the limit)

what do we care about these properties for?

  • efficacy of filtering bad behaviors in pretraining: sufficiently good robustness means doing things that achieve the goal even in states that it never saw during training, and then even in states that require strategies that it never saw during training. if we filter out deceptive alignment from the data, then the model has to do some generalizing to figure out that this is a strategy that can be used to better accomplish its goal (as a sanity check that robustness is the thing here: a LUT never trained on deceptive alignment will never do it, but one that is trained on it will do it, a sufficiently powerful optimizer will always do it)
  • arguments about updates wrt "goal": the deceptive alignment argument hinges a lot on "gradient of the goal" making sense. for example when we argue that the gradient on the model can be decomposed into one component that updates the goal to be more correct and another component that updates the capabilities to be more deceptive, we make this assumption. even if we assume away path dependence, the complexity argument depends a lot on the complexity being roughly equal to complexity of goal + complexity of general goal seeking circuitry, independent of goal.
  • arguments about difficulty of disentangling correct and incorrect behaviors: there's a dual of retargetability which is something like the extent to which you can make narrow interventions to the behaviour. (some kind of "anti naturalness" argument)

[conjecture 1: retargetability == complexity can be decomposed == gradient of goal is meaningful. conjecture 2: gradient of goal is meaningful/complexity decomposition implies deceptive alignment (maybe we can also find some necessary condition?)]

how do we formalize retargetability?

  • maybe something like there exists a homeomorphism from the goal space to NNs with that goal
  • problem: doesn't really feel very satisfying and doesn't work at all for discrete things
  • maybe complexity: retargetable if it has a really simple map from goals to NNs with goals, conditional on another NN with that goal
  • problem: the training process of just training another NN from scratch on the new goal and ignoring the given NN could potentially be quite simple
  • maybe complexity+time: seems reasonable to assume retraining is expensive (and maybe for decomposability we also consider complexity+time)

random idea: the hypothesis that complexity can be approximately decomposed into a goal component and a reasoning component is maybe a good formalization of (a weak version of) orthogonality?

My prior, not having looked too carefully at the post or the specific projects involved, is that probably any claims that an open source model is 90% as good as GPT4 or indistinguishable are hugely exaggerated or otherwise not a fair comparison. In general in ML, confirmation bias and overclaiming is very common and as a base rate the vast majority of papers that claim some kind of groundbreaking result end up just never having any real impact.

Also, I expect facets of capabilities progress most relevant to existential risk will be especially constrained strongly by base model quality. I would agree that open source is probably better at squeezing stuff out of small models, but my model is that wrt existential risk relevant capabilities progress this is less relevant (cf the bitter lesson).

I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:

  • The current precise transformer LM setup but bigger will never achieve AGI
  • A transformer trained on the language modelling objective will never achieve AGI (but a transformer network trained with other modalities or objectives or whatever will)
  • A language model with the transformer architecture will never achieve AGI (but a language model with some other architecture or training process will)

Which interventions make sense depends a lot on your precise model of why current models are not AGI, and I would consequently expect modelling things at the level of "LLMs vs not LLMs" to be less effective.

retargetability might be the distinguishing factor between controllers and optimizers

Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17). Jacob probably has more detailed takes on this than me. 

In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

Load More