Alignment Stream of Thought

Wiki Contributions


I think it's pretty unlikely (<5%) that decentralized volunteer training will be competitive with SOTA, ever. (Caveat: I haven't been following volunteer training super closely so this take is mostly cached from having looked into it for GPT-Neo plus occasionally seeing new papers about volunteer training).

  1. You are going to get an insane efficiency hit from the compute having very low bandwidth high latency interconnect. I think it's not inconceivable that someone will eventually figure out an algorithm that is only a few times worse than training on a cluster with good interconnects but this is one of those things where people have tried for ages.
  2. Heterogeneous compute (not all the GPUs will be the same model), lower reliability (people will turn their computers off more often than they do in datacenters), and having to be robust against bad actors (people could submit bad gradients) and other challenges together add another several times overhead.
  3. There just isn't that much volunteer hardware out there. For a rough OOM the publicly announced Facebook cluster is roughly the same size as the raw size of folding@home at its peak. All in all, I think you would need to do some serious engineering and research to get 1% efficiency at Facebook cluster scale.

(folding@home is embarrassingly parallelizable because it requires basically no interconnect, and therefore also doesn't mind heterogeneous compute or reliability)

here's a straw hypothetical example where I've exaggerated both 1 and 2; the details aren't exactly correct but the vibe is more important:

1: "Here's a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment"

2: "Debate works if you can actually set the goals of the agents (i.e you've solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]"

1: "Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever"

2: "how are you going to do that? your scheme doesn't tackle inner alignment, which seems to contain almost all of the difficulty of alignment to me. the claim you just made is a separate claim from your main scheme, and the cleverness in your scheme is in a direction orthogonal to this claim"

1: "idk, also that's a fully general counterargument to any alignment scheme, you can always just say 'but what if inner misalignment'. I feel like you're not really engaging with the meat of my proposal, you've just found a thing you can say to be cynical and dismissive of any proposal"

2: "but I think most of the difficulty of alignment is in inner alignment, and schemes which kinda handwave it away are trying to some some problem which is not the actual problem we need to solve to not die from AGI. I agree your scheme would work if inner alignment weren't a problem."

1: "so you agree that in a pretty nontrivial number [let's say both 1&2 agree this is like 20% or something] of worlds my scheme does actually work- I mean how can you be that confident that inner alignment is that hard? in the world's where inner alignment turns out to be easy then my scheme will work."

2: "I'm not super confident, but if we assume that inner alignment is easy then I think many other simpler schemes will also work, so the cleverness that your proposal adds doesn't actually make a big difference."

a common discussion pattern: person 1 claims X solves/is an angle of attack on problem P. person 2 is skeptical. there is also some subproblem Q (90% of the time not mentioned explicitly). person 1 is defending a claim like "X solves P conditional on Q already being solved (but Q is easy)", whereas person 2 thinks person 1 is defending "X solves P via solving Q", and person 2 also believes something like "subproblem Q is hard". the problem with this discussion pattern is it can lead to some very frustrating miscommunication:

  • if the discussion recurses into whether Q is hard, person 1 can get frustrated because it feels like a diversion from the part they actually care about/have tried to find a solution for, which is how to find a solution to P given a solution to Q (again, usually Q is some implicit assumption that you might not even notice you have). it can feel like person 2 is nitpicking or coming up with fully general counterarguments for why X can never be solved.
  • person 2 can get frustrated because it feels like the original proposed solution doesn't engage with the hard subproblem Q. person 2 believes that assuming Q were solved, then there would be many other proposals other than X that would also suffice to solve problem P, so that the core ideas of X actually aren't that important, and all the work is actually being done by assuming Q.

random brainstorming about optimizeryness vs controller/lookuptableyness:

let's think of optimizers as things that reliably steer a broad set of initial states to some specific terminal state seems like there are two things we care about (at least):

  • retargetability: it should be possible to change the policy to achieve different terminal states (but this is an insufficiently strong condition, because LUTs also trivially meet this condition, because we can always just completely rewrite the LUT. maybe the actual condition we want is that the complexity of the map is less than the complexity of just the diff or something?) (in other words, in some sense it should be "easy" to rewrite a small subset or otherwise make a simple diff to the policy to change what final goal is achieved) (maybe related idea: instrumental convergence means most goals reuse lots of strategies/circuitry between each other)
  • robustness: it should reliably achieve its goal across a wide range of initial states.

a LUT trained with a little bit of RL will be neither retargetable nor robust. a LUT trained with galactic amounts of RL to do every possible initial state optimally is robust but not retargetable (this is reasonable: robustness is only a property of the functional behavior so whether it's a LUT internally shouldn't matter; retargetability is a property of the actual implementation so it does matter). a big search loop (the most extreme of which is AIXI, which is 100% search) is very retargetable, and depending on how hard it searches is varying degrees of robustness.

(however, in practice with normal amounts of compute a LUT is never robust, this thought experiment only highlights differences that remain in the limit)

what do we care about these properties for?

  • efficacy of filtering bad behaviors in pretraining: sufficiently good robustness means doing things that achieve the goal even in states that it never saw during training, and then even in states that require strategies that it never saw during training. if we filter out deceptive alignment from the data, then the model has to do some generalizing to figure out that this is a strategy that can be used to better accomplish its goal (as a sanity check that robustness is the thing here: a LUT never trained on deceptive alignment will never do it, but one that is trained on it will do it, a sufficiently powerful optimizer will always do it)
  • arguments about updates wrt "goal": the deceptive alignment argument hinges a lot on "gradient of the goal" making sense. for example when we argue that the gradient on the model can be decomposed into one component that updates the goal to be more correct and another component that updates the capabilities to be more deceptive, we make this assumption. even if we assume away path dependence, the complexity argument depends a lot on the complexity being roughly equal to complexity of goal + complexity of general goal seeking circuitry, independent of goal.
  • arguments about difficulty of disentangling correct and incorrect behaviors: there's a dual of retargetability which is something like the extent to which you can make narrow interventions to the behaviour. (some kind of "anti naturalness" argument)

[conjecture 1: retargetability == complexity can be decomposed == gradient of goal is meaningful. conjecture 2: gradient of goal is meaningful/complexity decomposition implies deceptive alignment (maybe we can also find some necessary condition?)]

how do we formalize retargetability?

  • maybe something like there exists a homeomorphism from the goal space to NNs with that goal
  • problem: doesn't really feel very satisfying and doesn't work at all for discrete things
  • maybe complexity: retargetable if it has a really simple map from goals to NNs with goals, conditional on another NN with that goal
  • problem: the training process of just training another NN from scratch on the new goal and ignoring the given NN could potentially be quite simple
  • maybe complexity+time: seems reasonable to assume retraining is expensive (and maybe for decomposability we also consider complexity+time)

random idea: the hypothesis that complexity can be approximately decomposed into a goal component and a reasoning component is maybe a good formalization of (a weak version of) orthogonality?

My prior, not having looked too carefully at the post or the specific projects involved, is that probably any claims that an open source model is 90% as good as GPT4 or indistinguishable are hugely exaggerated or otherwise not a fair comparison. In general in ML, confirmation bias and overclaiming is very common and as a base rate the vast majority of papers that claim some kind of groundbreaking result end up just never having any real impact.

Also, I expect facets of capabilities progress most relevant to existential risk will be especially constrained strongly by base model quality. I would agree that open source is probably better at squeezing stuff out of small models, but my model is that wrt existential risk relevant capabilities progress this is less relevant (cf the bitter lesson).

I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:

  • The current precise transformer LM setup but bigger will never achieve AGI
  • A transformer trained on the language modelling objective will never achieve AGI (but a transformer network trained with other modalities or objectives or whatever will)
  • A language model with the transformer architecture will never achieve AGI (but a language model with some other architecture or training process will)

Which interventions make sense depends a lot on your precise model of why current models are not AGI, and I would consequently expect modelling things at the level of "LLMs vs not LLMs" to be less effective.

retargetability might be the distinguishing factor between controllers and optimizers

Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17). Jacob probably has more detailed takes on this than me. 

In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

takes on takeoff (or: Why Aren't The Models Mesaoptimizer-y Yet)

here are some reasons we might care about discontinuities:

  • alignment techniques that apply before the discontinuity may stop applying after / become much less effective
    • makes it harder to do alignment research before the discontinuity that transfers to after the discontinuity (because there is something qualitatively different after the jump)
    • second order effect: may result in false sense of security
  • there may be less/negative time between a warning shot and the End
    • harder to coordinate and slow down
    • harder to know when the End Times are coming
  • alignment techniques that rely on systems supervising slightly smarter systems (i.e RRM) depend on there not being a big jump in capabilities

I think these capture 90% of what I care about when talking about fast/slow takeoff, with the first point taking up a majority

(it comes up a lot in discussions that it seems like I can't quite pin down exactly what my interlocutor's beliefs on fastness/slowness imply. if we can fully list out all the things we care about, we can screen off any disagreement about definitions of the word "discontinuity")

some things that seem probably true to me and which are probably not really cruxes:

  • there will probably be a pretty big amount of AI-caused economic value and even more investment into AI, and AGI in particular (not really a bold prediction, given the already pretty big amount of these things! but a decade ago it may have been plausible nobody would care about AGI until the End Times, and this appears not to be the case)
  • continuous changes of inputs like compute or investment or loss (not technically an input, but whatever) can result in discontinuous jumps in some downstream metric (accuracy on some task, number of worlds paperclipped)
  • almost every idea is in some sense built on some previous idea, but this is not very useful because there exist many ideas [citation needed] and it's hard to tell which ones will be built on to create the idea that actually works (something something hindsight bias). this means you can't reason about how they will change alignment properties, or use them as a warning shot

possible sources of discontinuity:

  • breakthroughs: at some point, some group discovers a brand new technique that nobody had ever thought of before / nobody had made work before because they were doing it wrong in some way / "3 hackers in a basement invent AGI"
    • depends on how efficient you think the research market is. I feel very uncertain about this
    • importantly I think cruxes here may result in other predictions about how efficient the world is generally, in ways unrelated to AI, and which may make predictions before the End Times
    • seems like a subcrux of this is whether the new technique immediately works very well or if it takes a nontrivial amount of time to scale it up to working at SOTA scale
  • overdetermined "breakthroughs": some technique that didn't work (and couldn't have been made to work) at smaller scales starts working at larger scales. lots of people independently would have tried the thing
    • importantly, under this scenario it's possible for something to simultaneously (a) be very overdetermined (b) have very different alignment properties
    • very hard to know which of the many ideas that don't work might be the one that suddenly starts working with a few more OOMs of compute
  • at some scale, there is just some kind of grokking without any change in techniques, and the internal structure and generalization properties of the networks changes a lot. trends break because of some deep change in the structure of the network
    • mostly isomorphic to the previous scenario actually
    • for example, in worlds where deceptive alignment happens because at x params suddenly it groks to mesaoptimizer-y structure and the generalization properties completely change
  • at some scale, there is "enough" to hit some criticality threshold of some kind of thing the model already has. the downstream behavior changes a lot but the internal structure doesn't change much beyond the threshold. importantly while obviously some alignment strategies would break, there are potentially invariants that we can hold onto
    • for example, in worlds where deceptive alignment happens because of ontology mismatch and ontologies get slowly more mismatched with scale, and then past some threshold it snaps over to the deceptive generalization

I think these can be boiled down to 3 more succinct scenario descriptions:

  • breakthroughs that totally change the game unexpectedly
  • mechanistically different cognition suddenly working at scale
  • more of the same cognition is different
Load More