Consider this abridged history of recent ML progress:

A decade or two ago, computer vision was a field that employed dedicated researchers who designed specific increasingly complex feature recognizers (SIFT, SURF, HoG, etc.) These were usurped by deep CNNs with fully learned features in the 2010's[1], which subsequently saw success in speech recognition, various NLP tasks, and much of AI, competing with other general ANN models, namely various RNNs and LSTMs. Then SOTA in CNNs and NLP evolved separately towards increasingly complex architectures until the simpler/general transformers took over NLP and quickly spread to other domains (even RL), there also often competing with newer simpler/general architectures arising within those domains, such as MLP-mixers in vision. Waves of colonization in design-space.

So the pattern is: increasing human optimization power steadily pushing up architecture complexity is occasionally upset/reset by a new simpler more general model, where the new simple/general model substitutes human optimization power for automated machine optimization power[2], enabled by improved compute scaling, ala the bitter lesson. DL isn't just a new AI/ML technique, it's a paradigm shift.

Ok, fine, then what's next?

All of these models, from the earliest deep CNNs on GPUs up to GPT-3 and EfficientZero, generally have a few major design components that haven't much changed:

  1. Human designed architecture, rather than learned or SGD-learnable-at-all
  2. Human designed backprop SGD variant (with only a bit of evolution from vanilla SGD to Adam & friends)

Obviously there are research tracks in DL such as AutoML/Arch-search and Meta-learning aiming to automate the optimization of architecture and learning algorithms. They just haven't dominated yet.

So here is my hopefully-now-obvious prediction: in this new decade internal meta-optimization will take over, eventually leading to strongly recursively self optimizing learning machines: models that have broad general flexibility to adaptively reconfigure their internal architecture and learning algorithms dynamically based on the changing data environment/distribution and available compute resources[3].

If we just assume for a moment that the strong version of this hypothesis is correct, it suggests some pessimistic predictions for AI safety research:

  1. Interpretability will fail - future DL descendant is more of a black box, not less
  2. Human designed architectural constraint fails, as human designed architecture fails
  3. IRL/Value Learning is far more difficult than first appearances suggest, see #2
  4. Progress is hyper-exponential, not exponential. Thus trying to trend-predict DL superintelligence from transformer scaling is more difficult than trying to predict transformer scaling from pre 2000-ish ANN tech, long before rectifiers and deep layer training tricks.
  5. Global political coordination on constraints will likely fail, due to #4 and innate difficulty.

There is an analogy here to the history-revision attack against Bitcoin. Bitcoin's security derives from the computational sacrifice invested into the longest chain. But Moore's Law leads to an exponential decrease in the total cost of that sacrifice over time, which when combined with an exponential increase in total market cap, can lead to the surprising situation where recomputing the entire PoW history is not only plausible but profitable.[4]

In 2010 few predicted that computer Go would beat a human champion just 5 years hence[5], and far fewer (or none) predicted that a future successor of that system would do much better by relearning the entire history of Go strategy from scratch, essentially throwing out the entire human tech tree [6].

So it's quite possible that future meta-optimization throws out the entire human architecture/algorithm tech tree for something else substantially more effective[7]. The circuit algorithmic landscape lacks most all the complexity of the real world, and in that sense is arguably much more similar to Go or chess. Humans are general enough learning machines to do reasonably well at anything, but we can only apply a fraction of our brain capacity to such an evolutionary novel task, and tend to lose out to more specialized scaled up DL algorithms long before said algorithms outcompete humans at all tasks, or even everday tasks.

Yudkowsky anticipated recursive self-improvement would be the core thing that enables AGI/superintelligence. Reading over that 2008 essay now in 2021, I think he mostly got the gist of it right, even if he didn't foresee/bet that connectivism would be the winning paradigm. EY2008 seems to envision RSI as an explicit cognitive process where the AI reads research papers, discusses ideas with human researchers, and rewrites its own source code.

Instead in the recursive self-optimization through DL future we seem to be careening towards, the 'source code' is the ANN circuit architecture (as or more powerful than code), and reading human papers, discussing research: all that is unnecessary baggage, as unnecessary as it was for AlphaGo Zero to discuss chess with human chess experts over tea or study their games over lunch. History-revision attack, incoming.

So what can we do? In the worst case we have near-zero control over AGI architecture or learning algorithms. So that only leaves initial objective/utility functions, compute and training environment/data. Compute restriction is obvious and has an equally obvious direct tradeoff with capability - not much edge there.

Even a super powerful recursive self-optimizing machine initially starts with some seed utility/objective function at the very core. Unfortunately it increasingly looks like efficiency strongly demands some form of inherently unsafe self-motivation utility function, such as empowerment or creativity, and self-motivated agentic utility functions are the natural strong attractor[8].

Control over training environment/data is a major remaining lever that doesn't seem to be explored much, and probably has better capability/safety tradeoffs than compute. What you get out of the recursive self optimization or universal learning machinery is always a product of the data you put in, the embedded environment; that is ultimately what separates Go bots, image detectors, story writing AI, feral children, and unaligned superintelligences.

And then finally we can try to exert control on the base optimizer, which in this case is the whole technological research industrial economy. Starting fresh with a de novo system may be easier than orchestrating a coordination miracle from the current Powers.


  1. Alexnet is typically considered the turning point, but the transition started earlier; sparse coding and RBMs are two examples of successful feature learning techniques pre-DL. ↩︎

  2. If you go back far enough, the word 'computer' itself originally denoted a human occupation! This trend is at least a century old. ↩︎

  3. DL ANNs do a form of approximate bayesian updating over the implied circuit architecture space with every backprop update, which already is a limited form of self-optimization. ↩︎

  4. Blockchain systems have a simple defense against history-revision attack: checkpointing, but unfortunately that doesn't have a realistic equivalent in our case - we don't control the timestream. ↩︎

  5. My system-1 somehow did in this 2010 LW comment. ↩︎

  6. I would have bet against this; AlphaGo Zero surprised me far more than AlphaGo. ↩︎

  7. Quite possible != inevitable. There is still a learning efficiency gap vs the brain, and I have uncertainty over how quickly we will progress past that gap, and what happens after. ↩︎

  8. Tool-AI, like GPT-3, is a form of capability constraint, in that economic competition is always pressuring tool-AIs to become agent-AIs. ↩︎

30

22 comments, sorted by Click to highlight new comments since: Today at 11:25 AM
New Comment

I don't think the scenario you describe is as bad for interpretability as you assume. In fact, self-optimizing systems may even be more interpretable than current systems. E.g., current systems use a single channel for all their computation. This causes them to mix conceptually different types of computation together in a way that's very difficult to unravel. In contrast, the brain has sub-regions specializing in different types of computation (vision, hearing, reward calculation, etc). I expect self-optimizing systems will do something similar.

Also, I'm not sure how much being able to design the architecture/optimizer helps us in interpretability. In my recent post, I argue the brain is in many ways more interpretable than current deep learning systems. We still don't understand the brain's learning algorithm and there are many difficulties associated with studying the brain, yet brain interpretability research is surprisingly advanced.

Most ML interpretability research tends to rely on examining patterns in model internal representations, feature visualizations, gradient-based attribution of inputs/neurons, training classifiers on model internal representations, or studying black box input/output patterns. I think most of those can be adapted to self-optimizing models, though we may have to do some work to find a useful gradient-equivalent if the system messes with the optimizer too much (note that we don't have any gradient-equivalent in neuroscience).

Finally, I think there are many options available for improving interpretability in ML systems, none of which we are using for current state of the art systems. Just switching from L2 to L1 regularization and removing dropout would probably lead to much sparser internal representations. I think there are also ways of directly training models to be more interpretable. I describe a way to use current interpretability techniques to generate an estimator for model interpretability in the same post linked above. We could then include that signal in the training objective for self-optimizing systems. Hopefully, the system learns to be more accessible to the interpretability techniques we use.

Another option: given a large, "primary" model, you could train multiple smaller "secondary" models (with different architectures) to imitate the primary model (knowledge distillation), then train the primary model to improve the secondary models' imitation performance. This should cause the primary model to learn internal representations that are more easily learned by other models of various architectures, and are hopefully more interpretable to humans. If you then assume the primary model is a self-optimizing system, then this approach becomes even more promising because now the self-optimizing systems is actively looking for architectures that are easy for weaker models to understand.

This approach is even defensible from a purely profit-seeking / competitive point of view because it's a good idea to have your most powerful model be a good teacher to smaller models. That way, you can more easily distill its capabilities into a cheaper system and save money on compute.

Some anecdotal evidence: in the last few months I was able to improve on three 2021 conference-published, peer-reviewed DL papers. In each case, the reason I was able to do it was that the authors did not fully understand why the technique they used worked and obviously just wrote a paper around something that they experimentally found to be working. In addition, there are two pretty obvious bugs in a reasonably popular optimization library (100+ github stars) that reduce performance and haven't been fixed or noticed in "Issues" for a long time. Seems that none of its users went step-by-step or tried to carefully understand what was going on.

What all four of these have in common is that they are still actually working, just not optimally. Their experimental results are not fake. This does not fill me with hope for the future of interpretability.

In addition, there are two pretty obvious bugs in a reasonably popular optimization library (100+ github stars) that reduce performance and haven't been fixed or noticed in "Issues" for a long time.

Karpathy's law: "neural nets want to work". This is another source of capabilities jumps: where the capability 'existed', but there was just a bug that crippled it (eg R2D2) with a small, often one-liner, fix.

The more you have a self-improving system that feeds back into itself hyperbolically, the more it functions end-to-end and removes the hardwired (human-engineered) parts that Amdahl's-laws the total output, the more you may go from "pokes around doing nothing much, diverging half the time, beautiful idea, too bad it doesn't work in the real world" to "FOOM". (This is also the model of the economy that things like Solow growth models usually lead to: humanity or Europe pokes around doing nothing much discernible, nothing anyone like chimpanzees or the Aztec Empire should worry about, until...)

Karpathy's law: "neural nets want to work". This is another source of capabilities jumps: where the capability 'existed', but there was just a bug that crippled it

I've experienced this first hand, spending days trying to track down disappointing classification accuracy, assuming some bug in my model/math, only to find out later it was actually a bug in a newer custom matrix mult routine that my (insufficient) unit tests didn't cover. It had just never occurred to me that GD could optimize around that.

And on a related note, some big advances - arguably even transformers - are more a case of just getting out of SGD's way to let it do its thing rather than some huge new insight.

Out of curiosity, are you willing to share the papers you improved upon?

I'd like to but it'll have to wait until I'm finished with a commercial project where I'm using them or until I replace these techniques with something else in my code. I'll post a reply here once I do. I'd expect somebody else to discover at least one of them in the meantime, they're not some stunning insights.

One of these improvements was just published: https://arxiv.org/abs/2202.03599 . Since they were able to publish already, they likely had this idea before me. What I noticed is that in the Sharpness-Aware Minimization paper (ICLR 2021, https://arxiv.org/abs/2010.01412), the first gradient is just ignored when updating the weights, as can be seen in Figure 2 or in pseudo-code. But that's a valuable data point that the optimizer would normally use to update the weights, so why not do the update step by using a value in between the two. And it works.

The nice thing is that it's possible to implement this without increasing the memory requirements or the compute (almost) compared to SAM: you don't need to store the first gradient separately, just multiply it by some factor, don't zero out the gradients, let the second gradient be accumulated, and rescale the sum.

Yes! Anecdotal confirmation of my previously-held beliefs!

I don't think the scenario you describe is as bad for interpretability as you assume. In fact, self-optimizing systems may even be more interpretable than current systems. E.g., current systems use a single channel for all their computation. This causes them to mix conceptually different types of computation together in a way that's very difficult to unravel. In contrast, the brain has sub-regions specializing in different types of computation (vision, hearing, reward calculation, etc). I expect self-optimizing systems will do something similar.

I'm not exactly sure what you mean by "single channel", but I do agree that regional specialization in the brain probably makes it more interpretable. Regional specialization arises because of locality optimization to minimize wiring length. But with ANNs running on von neumman hardware we simulate circuits and don't optimize for locality currently. However that could change with either neuromorphic hardware or harder sparsity optimization on normal hardware. So yes those are reasons to be more optimistic.

I'm not actually convinced that interpretability is doomed - in the OP I was exploring something of a worst case possibility. The scenarios where interpretability fails are those where the internal meta-optimization fooms in complexity well beyond that of the brain and our affordable comprehension toolkit. It's not a matter of difficulty in analyzing/decoding a static architecture, the difficulty is in analyzing a rapidly evolving architecture that may be distributed/decentralized, and the costs thereof. If we end with something brain-like, then interpretability is promising. But interepretability becomes exponentially harder with rapid self-optimization and architectural diversity, especially when/if the systems are geographically distributed/decentralized across organizational boundaries (which certainly isn't the case now, but could be what the AI economy evolves into).

In your interesting post you mention a DM paper investigating AlphaZero's learned chess representations. The issue is how the cost of that analysis scales as we move to recursively self-optimizing systems and we scale up compute. (I'm also curious about the same analysis for Go)

Also, I'm not sure how much being able to design the architecture/optimizer helps us in interpretability. In my recent post, I argue the brain is in many ways more interpretable than current deep learning systems. We still don't understand the brain's learning algorithm and there are many difficulties associated with studying the brain, yet brain interpretability research is surprisingly advanced.

Before reading your article, my initial take was that interpretability techniques for ANNs and BNNs are actually not all that different - but ANNs are naturally much easier to monitor and probe.

Just switching from L2 to L1 regularization and removing dropout would probably lead to much sparser internal representations.

Yes agreed as discussed above.

So in summary - as you note in your post: "We currently put very little effort into making state of the art systems interpretable."

The pessimistic scenario is thus simply that this status quo doesn't change, because willingness to spend on interpretability doesn't change much, and moving towards recursive self-optimization increases the cost.

I'm not exactly sure what you mean by "single channel"

I mean the thing where BERT has a single stack of sequential layers which each process the entire latent representation of the previous layer. In contrast, imagine a system that dynamically routes different parts of the input to different components of the models, then has those model components communicate with each other to establish the final output. 

At first glance, the dynamically routing model seems much less interpretable. However, I think we'll find that different parts of the dynamic model will specialize to process different types of input or perform different types of computation, even without an explicit regularizer that encourages sparse connections or small circuits. I think this will aid interpretability quite a lot. 

I don't think a self-optimizing architecture will change its internals as quickly or often as you seem to be implying. A major hint here is that brain architecture doesn't vary that much across species (a human brain neuroscientist can easily adapt their expertise to squirrel brains). Additionally, most deep learning advances seem more along the lines of "get out of SGD's way" or "make things more convenient for SGD" than "add a bunch of complex additional mechanisms". I think we plausibly end up with a handful clusters in architecture space that have good performance on certain domains and that further architecture search doesn't stray too far from those clusters.

I also think that highly varied, distributed systems have much better interpretability prospects than you might assume. Consider that neural nets face their own internal interpretability issues. Different parts of the network need to be able to communicate effectively with each other on at least two levels:

  1. Different parts need to share the results of their computation in a way that's mutually legible.
  2. Different parts need to communicate with each other about how their internal representations should change so they can more effectively coordinate their computation.
    • I.e., if part A is looking for trees and part B is looking for leaves, part A should be able to signal to part B about how it's leaf detectors should change so that part A's tree detectors function more effectively.
    • We currently use SGD for this coordination (and not-coincidentally, gradients are very important interpretability tools), but even some hypothetical learned alternative to SGD would need to do this too.

Importantly, as systems become more distributed, varied and adaptable, the premium on effective cross-system communication increases. It becomes more and more important that systems with varied architectures be able to understand each other. 

The need for cross-regional compatibility implies that networks tend to learn representations that are maximally interpretable to other parts of the network. You could object that there's no reason such representations have to be human interpretable. This objection is partially right. There's no reason that the raw internal representations have to be human interpretable. However, the network has to accept/generate human interpretable input/output, so it needs components that translate its internal representations to human interpretable forms. 

Because the "inner interpretability" problem described above forces models to use consistent internal representations, we should be able to apply the model's own input/output translation components to the internal representations anywhere in the model to get human interpretable output.

This is roughly what we see in the papers "Transformer Feed-Forward Layers Are Key-Value Memories" and "Knowledge Neurons in Pretrained Transformers" and LW post "interpreting GPT: the logit lens". They're able to apply the vocabulary projection matrix (which generates the output at the final layer) to intermediate representations and get human interpretable translations of those representations. 

"Knowledge Neurons in Pretrained Transformers" was also able to use input embeddings to modify knowledge stored in neurons from the intermediate layers. E.g., given a neuron storing "Paris is the capital of France", the authors subtract the embedding of "Paris" and add the embedding of "London". This causes the model to output London instead of France for French capital-related queries (though the technique is not fully reliable).

These successes are very difficult to explain in a paradigm where interpretability is unrelated to performance, but are what you'd expect from thinking about what models would need to do to address the inner interpretability problem I describe above.

Before reading your article, my initial take was that interpretability techniques for ANNs and BNNs are actually not all that different - but ANNs are naturally much easier to monitor and probe.

My impression is that they're actually pretty different. For example, ML interpretability has access to gradients, whereas brain interpretability has much greater expectations of regional specialization. ML interpretability can do things like feature visualization, saliency mapping, etc. Brain interpretability can learn more from observing how damage to different region impacts human behavior. Also, you can ask humans to introspect and do particular mental tasks while you analyze them, which is much harder for models.

The pessimistic scenario is thus simply that this status quo doesn't change, because willingness to spend on interpretability doesn't change much, and moving towards recursive self-optimization increases the cost.

I think a large part of why there's so little interpretability work is that people don't think it's feasible. The tools for doing it aren't very good, and there's no overarching paradigm that tells people where to start or how to proceed (or more accurately, the current paradigm says don't even bother). This is a big part of why I think it's so important to make a positive case for interpretability.

TLDR: I think our crux reduces to some mix of these 3 key questions

  1. how system complexity evolves in the future with recursive self optimization
  2. how human interpretability cost scales with emergent system complexity
  3. willingness to spend on interpretability in the future

So my core position is then:

  1. system complexity is going to explode hyperexponentially (which is just an obvious prediction of the general hyperexponential trend)
  2. interpretability cost thus scales suboptimally with humans in the loop (probably growing exponentially) and
  3. willingness to spend on interpretability won't change enormously

In other words, future ML systems will reach a point where they evolve faster than we can understand. This may be a decade away or more, but it's not a century away.

For interpretability to scale in this scenario, you need to outsource it to already trusted systems (ie something like iterated amplification).

I mean the thing where BERT has a single stack of sequential layers which each process the entire latent representation of the previous layer. In contrast, imagine a system that dynamically routes different parts of the input to different components of the models, then has those model components communicate with each other to establish the final output.

The system you are imaging sounds just equivalent to transformers. Content based dynamic routing, soft attention, and content-addressable memory are all actually just variations/descriptions of the same thing.

A matrix multiply of A*M where A consists of 1-hot row vectors is mathematically equivalent to an array of memory lookup ops, where each 1-hot row of A is a memory address referencing some row of memory matrix M. Relaxing the 1-hot constraint naturally can't make it less flexible than a memory lookup, it becomes a more general soft memory blend operation.

Then if you compute A with a nonlinear input layer of the form A = f(Q*K), where f is some competitive non-linearity, and Q and K are query and key matrices, that implements a more general soft version of content-based addressing, chain em together and you get soft content-addressable memory (which obviously is universal and equivalent/implements routing).

Standard relu deepnets don't use matrix transpose in the forward pass, only the back pass, and thus have fixed K and M matrices that only change slowly with SGD. They completely lack the ability to do attention/routing/memory operations over dynamic activations. Transformers add the transpose op as a fwd pass building block allowing the output activations to feed into K and/or M, which simultaneously enables universal attention/routing/memory operations.

I don't think a self-optimizing architecture will change its internals as quickly or often as you seem to be implying.

I disagree - you may simply be failing to imagine the future as I do. Fully justifying why I disagree is not something I should do on a public forum, but I will say that the brain is already changing it's internals more quickly than you seem to be implying, and ANNs on von neumann hardware are potentially vastly more flexible in their ability to expand/morph existing layers, add modules, distill others, explore new pathways, learn to predict sub-module training trajectories, meta-learn to predict .. etc. The brain is limited by the topological constraints of both the far less flexible neuromorphic substrate and a constrained volume; constraints that largely do not apply to ANNs on von neumman hardware.

Additionally, most deep learning advances seem more along the lines of "get out of SGD's way" or "make things more convenient for SGD" than "add a bunch of complex additional mechanisms".

Getting out of the optimizer's way allows it to explore complexity beyond human capability. The bitter lesson is not one of simplicity beating complexity. It is about design complexity emergence shifting from the human optimization substrate to the machine optimization substrate. The main point I was making is that meta subsumes - moving from architecture to meta-learned meta-architecture (e.g. recursions of learning to learn the architecture of compressed hyper networks that generate/train the lower level architectures).

However, the network has to accept/generate human interpretable input/output, so it needs components that translate its internal representations to human interpretable forms.

Both a 1950's computer and a 2021 GPU-based computer, each running some complex software of their era, accept/generate human interpretable inputs/outputs, but one is enormously more difficult to understand at any deep level.

This is roughly what we see in the papers "Transformer Feed-Forward Layers Are Key-Value Memories"

Side note, but that's such a dumb name for a paper - it's the equivalent of "Feed-Forward Network Layers are Soft Threshold Memories", and only marginally better than "Linear Neural Network Layers Are Matrix Multiplies".

I'm not actually convinced that interpretability is doomed - in the OP I was exploring something of a worst case possibility.

Might be useful to mark this in the post. Perhaps a comment at the beginning about this post exploring a model, and its implications.

I forgot to reply to this important part in my other comment:

Another option: given a large, "primary" model, you could train multiple smaller "secondary" models (with different architectures) to imitate the primary model (knowledge distillation), then train the primary model to improve the secondary models' imitation performance. This should cause the primary model to learn internal representations that are more easily learned by other models of various architectures, and are hopefully more interpretable to humans. If you then assume the primary model is a self-optimizing system, then this approach becomes even more promising because now the self-optimizing systems is actively looking for architectures that are easy for weaker models to understand.

I was already assuming that distillation/compression was part of recursive self-optimization - it's certainly something the brain does. I'm more doubtful that this improves interpretability for free, absent explicit regularization criteria and its associated costs. The regions of the brain that seem most involved in distillation are those that took the longest for us to understand, and or are still mysterious - such as the cerebellum (motor control was a red herring, it's probably involved in training or distilling the cortex).

What I was proposing there isn't distillation/compression of the primary model. Rather, it's training the primary model to have internal representations that are easily learned by other systems. Knowledge distillation is the process the other systems use to learn the primary model's representations. As far as I know, the brain doesn't do anything like this.

Imagine the primary model has a collection of 10 neurons that collectively represent 10 different concepts, but all the neurons are highly polysemantic. There's no single neuron that corresponds to a single one of the concepts. When the primary model needs a pure representation, it uses some complex function of the 10 neurons' activations to recover a pure representation.

This is pretty bad from an interpretability perspective, and my guess is that it also makes it more difficult to use the primary model as a teacher for knowledge distillation. Student models have to learn the disentangling function before they can get a pure representation. In contrast, knowledge distillation would be easier if those 10 neurons each uniquely represented a single concept. That's what the primary model is being trained for.

By training the primary model to be easily interpretable to students, I hope to get a primary model whose representations are generally interpretable to both student models and humans. 

Distillation is simply the process of one network learning to model another, usually by predicting its outputs on the same inputs, but there are many variations. The brain certainly uses distillation: deepmind's founding research was based on hippocampal replay wherein the hippocampus trains the cortex, a form of distillation.

Leaving aside whether "training the primary model to have internal representations that are easily learned by other systems" is an effective explainability technique at all vs alternatives, both the training of the explainer distillations and any associated implied explainability side objective impose a cost.

The brain evidence is relevant because it suggests that distillation for primary capability purposes of compression/efficiency/etc does not increase interpretability for free, and thus it has some capability tradeoff cost.

All that being said, it does seem that sparsity (or other forms of compression bottlenecks) can aid interpretability by reducing complexity, filtering noise, etc thus speeding up downstream learning of those internal representations. But it would be surprising if the ideal sparsity for efficiency/capability happened to be the same as the ideal for interpretability/explainability.

Trying to summarize your viewpoint, lmk if I'm missing something important:

  •  Training self-organizing models on multi-modal input will lead to increased modularization and in turn to more interpretability
  • Existing interpretability techniques might more or less transfer to self-organizing systems
  • There are low-hanging fruits in applied interpretability that we could exploit should we need them in order to understand self-organizing systems
  • (Not going into the specific proposals for sake of brevity and clarity)

Small nitpick: I would cite The Bitter Lesson in the beginning.

Yes thanks, good point.

So the pattern is: increasing human optimization power steadily pushing up architecture complexity is occasionally upset/reset by a new simpler more general model,
...
So what can we do? In the worst case we have near-zero control over AGI architecture or learning algorithms. So that only leaves initial objective/utility functions, compute and training environment/data. Compute restriction is obvious and has an equally obvious direct tradeoff with capability - not much edge there.

Interesting that 'less control' is going hand in hand with 'simpler models'.

Interpretability will fail - future DL descendant is more of a black box, not less

 

It certainly makes interpretability harder, but it seems like the possible gain is also larger, making it a riskier bet overall. I'm not convinced that it decreases the expected value of interpretability research though. Do you have a good intuition for why it would make interpretability less valuable or at least lower value compared to the increased risk of failure?

 

IRL/Value Learning is far more difficult than first appearances suggest, see #2

 

That's not immediately clear to me. Could you elaborate?

Do you have a good intuition for why it would make interpretability less valuable or at least lower value compared to the increased risk of failure.

Just discussed this above with Quintin Pope - it's more a question of interpretability cost scaling unfavorably with complexity from RSO.

IRL/Value Learning is far more difficult than first appearances suggest, see #2

That's not immediately clear to me. Could you elaborate?

First off, Value learning is essential for successful alignment. It really has two components though: learning the utility/value functions of external agents, and then substituting those as the agent's own utility/value function. The first part we get for free, the second part is difficult because it conflicts with intrinsic motivation - somehow we need to transition from intrinsic motivation (which is what drives the value learning in the first place), to the learned external motivation. Getting this right was difficult for bio evolution, and seems far more difficult for future ANNs with a far more liquid RSO architecture. I have a future half-written post going deeper into this. I do think we can/should learn a great deal more about how altruism/empathy/VL works in the brain.

It certainly makes interpretability harder

I'm not at all convinced of this. In fact, I suspect self-optimizing systems will be more interpretable (assuming we're willing to bother putting any effort towards this goal). See my comment here making this case.