eggsyntax

AI safety & alignment researcher

Wiki Contributions

Comments

Minor suggestion to link to more info on LLC (or at least write it out as local learning coefficient) the first time it's mentioned -- I see that you do that later down but seems confusing on first read.

Not consciously an inspiration for me, but definitely a similar idea, and applies pretty cleanly to a system with only a couple of layers; I'll add some clarification to the post re more complex systems where that analogy might not hold quite as well.

I might think of some of your specific examples a bit differently, but yeah, I would say that a particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below. I'll add some clarification to the post, thanks!

Oh great, I'm glad it helped!

how to guide the pretraining process in a way that benefits alignment

One key question here, I think: a major historical alignment concern has been that for any given finite set of outputs, there are an unbounded number of functions that could produce it, and so it's hard to be sure that a model will generalize in a desirable way. Nora Belrose goes so far as to suggest that 'Alignment worries are quite literally a special case of worries about generalization.' This is relevant for post-training but I think even more so for pre-training.

I know that there's been research into how neural networks generalize both from the AIS community and the larger ML community, but I'm not very familiar with it; hopefully someone else can provide some good references here.

Terminology proposal: scaffolding vs tooling.

I haven't seen these terms consistently defined with respect to LLMs. I've been using, and propose standardizing on:

  • Tooling: affordances for LLMs to make calls, eg ChatGPT plugins.
  • Scaffolding: an outer process that calls LLMs, where the bulk of the intelligence comes from the called LLMs, eg AutoGPT.

Some smaller details:

  • If the scaffolding itself becomes as sophisticated as the LLMs it calls (or more so), we should start focusing on the system as a whole rather than just describing it as a scaffolded LLM.
  • This terminology is relative to a particular LLM. In a complex system (eg a treelike system with code calling LLMs calling code calling LLMs calling...), some particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below.
  • It's reasonable to think of a system as scaffolded if the outermost layer is a scaffolding layer.
  • There's are other possible categories that don't fit this as neatly, eg LLMs calling each other as peers without a main outer process, but I expect these definitions to cover most real-world cases.

Thanks to @Andy Arditi for helping me nail down the distinction.

In general, I agree. Though there are notable exceptions

Totally, good point!

[Epistemic note: I'm going past my expertise here, just giving my own current understanding, I encourage people with more expertise on this (possibly including you) to correct me]

Merrill and Sabharwal have done some very useful-seeming work on how strong transformers can be, both under auto-regressive conditions ('The Expressive Power of Transformers with Chain of Thought') and in a single forward pass with assumptions about precision (eg 'The Parallelism Tradeoff: Limitations of Log-Precision Transformers'), although I haven't gone through it in detail. Certainly it's unambiguously true that there are limitations to what can be done in a single forward pass.

Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search)

As a side note, I'm not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn't apply equally to linear search?

If we want an agent that can plan over an unbounded number of steps...it will need some component that can do an arbitrary number of iterative or recursive steps.

No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time, so it's not immediately clear to me how iteration/recursion is fundamentally different in practice. I think the better comparison is the power of a transformer under auto-regressive conditions (note that intermediate steps don't need to be shown to a user as output). The first paper linked above shows that given polynomial intermediate steps, transformers can handle exactly the class of problems that can be solved in polynomial time, aka P. So under those conditions they're pretty strong and general (which certainly doesn't mean that LLMs trained on next-token prediction are that strong and general).

One useful way to think about it, for me, is that auto-regression is a form of recursion, albeit one that's bounded in current architecture by the context length.

[Pasting in some of my responses from elsewhere in case they serve as a useful seed for discussion]

I would say that 3 is one of the core problems in reinforcement learning, but doesn't especially depend on 1 and 2.

To me, these 3 points together imply that, in principle, it's impossible to design an AI that doesn't game our goals (because we can't know with perfect precision everything that matters to achieve these goals).

How about a thermostat? It's an awfully simple AI (and not usually created via RL, though it could be), but I can specify my goals in a way that won't be gamed, without having to know all the future temperatures of the room, because I can cleanly describe the things I care about.

To be clear, I definitely think that as loss functions become more complex, and become less perfect proxies for what you really want, you absolutely start running into problems with specification gaming and goal misgeneralization. My goal with the thermostat example is just to point out that that isn't (as far as I can see) because of a fundamental limit in how precisely you can predict the future.

Alejandro: Fair point! I think I should've specified that I meant this line of reasoning as a counterargument to advanced general-purpose AIs, mainly.

I suspect the same argument holds there, that the problems with 3 aren't based on 1 and 2, although it's harder to demonstrate with more advanced systems.Here's maybe one view on it: suppose for a moment that we could perfectly forecast the behavior of physical systems into the future (with the caveat that we couldn't use that to perfectly predict an AI's behavior, since otherwise we've assumed the problem away). I claim that we would still have the same kinds of risks from advanced RL-based AI that we have now, because we don't have a reliable way to clearly specify our complete preferences and have the AI correctly internalize them.(unless the caveated point is exactly what you're trying to get at, but I don't think that anyone out there would say that advanced AI is safe because we can perfectly predict the physical systems that include them, since in practice we can't even remotely do that

I'll register that I like the name, and think it's an elegant choice because of those two references both being a fit (plus a happy little call-out to The Golden Compass). If the app gets wildly popular it could be worth changing, but at this point I imagine you're at least four orders of magnitude away from reaching an audience that'll get weirded out by them being called dæmons.

Load More