Wiki Contributions



I feel like the biggest issue with aligning powerful AI systems, is that nearly all the features we'd like these systems to have, like being corrigible, not being deceptive, having values aligned with ours etc, are properties we are currently unable to state formally. They are clearly real properties, like humans can agree on examples of non-corrigibility, misalignment, dishonest, when shown examples of actions AIs could take. But we can't put them in code or a program specification, and consequently can't reason about them very precisely, test whether systems have them or not etc

One reason I'm very bullish on mechinterp is that it seems like the only natural pathway towards making progress on this. Transformers trained with RLHF do have "tendencies" and proto-values in a sense, figuring out how those proto-desires are represented internally, really understanding it, I believe will shed a lot of light on how values form in transformers, will necessarily entail getting a solid formal framework for reasoning aobut these processes, and will put the notions of alignment on much firmer ground. Same goes for the other features. Models already show deceptive tendencies. In the process of developing deep mechinterp understanding of that, I believe we'd gain better understanding of how deception in a neural net can be modeled formally, which would allow us to reason about it infinitely better.

(I mean, someone 300IQ might come along and just galaxy brain all this from first principles, but quite galaxy brained people have tried already.. The point is that if mechinterp was developed to a sophisticated enough level, in addition to all the good things listed already, it would shed a lot of conceptual clarity on many of the key notions, which we are currently stuck reasoning about on an informal level, and I think we will get there through incremental progress, without having to hope someone just figures it out by thinking really hard and having an einstein-tier insight).



Other people were commending your tabooing of words, but I feel using terms like "multi-layer parameterized graphical function approximator" fails to do that, and makes matters worse because it leads to non-central fallacy-ing. It'd been more appropriate to use a term like "magic" or "blipblop". Calling something a function appropriator leads to readers carrying a lot of associations into their interpretation, that probably don't apply to deep learning, as deep learning is a very specific example of function approximation, that deviates from the prototypical examples in many respects. (I think when you say "function approximator" the image that pops into most peoples head is fitting a polynomial to a set of datapoints in R^2)

Calling something a function approximator is only meaningful if you make a strong argument for why a function approximator cant (or at least is systematically unlikely to) give rise to specific dangerous behaviors or capabilities. But I don't see you giving such arguments in this post. Maybe I did not understand it. In either case, you can read posts like Gwern's "Tools want to be agents" or Yudkowsky's writings, explaining why goal directed behavior is a reasonable thing to expect to arise from current ML, and you can replace every instance of "neural network" / "AI" with "multi-layer parameterized graphical function approximator", and I think you'll find that all the arguments make just as much sense as they did before. (modulo some associations seeming strange, but like I said, I think thats because there is some non-central fallacying going on).