Wiki Contributions


Thank you so much for taking the time to read our paper, Chris! I'm extremely grateful.

Thank you so much for sharing this extremely insightful argument, Evan! I really appreciate hearing your detailed thoughts on this.

I've been grappling with the pros and cons of an atheoretical-empirics-based approach (in your language, "behavior") and a theory-based approach (in your language, "understanding") within the complex sciences, such as but not limited to AI. My current thought is that unfortunately, both of the following are true:

1) Findings based on atheoretical empirics are susceptible to being brittle, in that it is unclear whether or in precisely which settings these findings will replicate. (e.g., see "A problem in theory" by Michael Muthukrishna and Joe Henrich:

2) While theoretical models enable one to meaningfully attempt predictions that extrapolate outside of the empirical sample, these models can always fail, especially in the complex sciences. "There is no such thing as a validated predictive model" (

A common difficulty that theory-based predictions out-of-distribution run into is the tradeoff between precision and generality. Levin ( described this idea by saying that among three desirable properties---generality, precision, and realism---a theory can only simultaneously achieve two. The following is Levin's triangle:

Thank you so much for the excellent and insightful post on mechanistic models, Evan!

My hypothesis is that the difficulty of finding mechanistic models that consistently make accurate predictions is likely due to the agent-environment system’s complexity and computational irreducibility. Such agent-environment interactions may be inherently unpredictable "because of the difficulty of pre-stating the relevant features of ecological niches, the complexity of ecological systems and [the fact that the agent-ecology interaction] can enable its own novel system states."

Suppose that one wants to consistently make accurate predictions about a computationally irreducible agent-environment system. In general, the most efficient way to do so is to run the agent in the given environment. There are probably no shortcuts, even via mechanistic models. 

For dangerous AI agents, an accurate simulation box of the deployment environment would be ideal for safe empiricism. This is probably intractable for many use cases of AI agents, but computational irreducibility implies that methods other than empiricism are probably even more intractable.

Please read my post “The limited upside of interpretability” for a detailed argument. It would be great to hear your thoughts!

Thank you very much for the honest and substantive feedback, Harfe! I really appreciate it.

I think the disagreeing commenters and perhaps many of the downvoters agreed that the loss in secrecy value was a factor, but disagreed about the magnitude of this effect (and my claim that it may be comparable or even exceed the magnitude of the other effect, a reduction in the number of AI safety plans and new researchers).

Quoting my comment on the EA forum for discussion of the cruxes and how I propose they may be updated:

"Thank you so much for the clarification, Jay! It is extremely fair and valuable.

I don't really understand how this is supposed to be an update for those who disagreed with you. Could you elaborate on why you think this information would change people's minds?

The underlying question is: does the increase in the amount of AI safety plans resulting from coordinating on the Internet outweigh the decrease in secrecy value of the plans in EV? If the former effect is larger, then we should continue the status-quo strategy. If the latter effect is larger, then we should consider keeping safety plans secret (especially those whose value lies primarily in secrecy, such as safety plans relevant to monitoring). 

The disagreeing commenters generally argued that the former effect is larger, and therefore we should continue the status-quo strategy. This is likely because their estimate of the latter effect was quite small and perhaps far-into-the-future.

I think ChatGPT provides evidence that the latter should be a larger concern than many people's prior. Even current-scale models are capable of nontrivial analysis about how specific safety plans can be exploited, and even how specific alignment researchers' idiosyncrasies can be exploited for deceptive misalignment. 

For this to be a threat, we would need an AGI that was

- Misaligned
- Capable enough to do significant damage if it had access to our safety plans
- Not capable enough to do a similar amount of damage without access to our safety plans

I see the line between 2 and 3 to be very narrow. I expect almost any misaligned AI capable of doing significant damage using our plans to also be capable of doing significant damage without needing them.

I am uncertain about whether the line between 2 and 3 will be narrow. I think the argument of the line between 2 and 3 being narrow often assumes fast takeoff, but I think there is a strong empirical case that takeoff will be slow and constrained by scaling, which suggests the line between 2 and 3 might be larger than one might think. But I think this is a scientific question that we should continue to probe and reduce our uncertainty about!"

Thank you so much for this writeup of your fascinating findings about interpreting the SVD of the weight matrix, Beren and Sid!

Understanding the degree to which transformer representations are linear vs nonlinear, and developing methods that can help us discover, locate, and interpret nonlinear representations will ultimately be necessary for fully solving interpretability of any nonlinear neural network.

Completely agree. For what it's worth, I expect interpreting nonlinear representations in complex neural nets to be quite difficult. We should expect linear-algebra methods like SVD to uncover useful information about linear representations in a straightforward manner. But we shouldn't overupdate as a result of the ease with which linear-algebra methods uncovers this subset of information, because a lot of the relevant information is likely to pertain to nonlinear and interconnected representations, and therefore outside of this subset.

Analyses of weights of a given network therefore is a promising type of static analysis for neural networks equivalent to static analysis of source code which can just be run quickly on any given network before actually having to run it on live inputs. This could potentially be used for alignment as a first line of defense against any kind of harmful behaviour without having to run the network at all. Techniques that analyze the weights are also typically cheaper computationally, since they do not involve running large numbers of forward passes through the network and/or storing large amounts of activations or dealing with large datasets.

Conversely, the downsides of weight analysis is that it cannot tell us about specific model behaviours on specific tokens. The weights instead can be thought of as encoding the space of potential transformations that can be applied to a specific input datapoint but not any specific transformation. They probably can also be used to derive information about average behaviour of the network but not necessarily extreme behaviour which might be most useful for alignment. 

I thought this was a really good summary of the pros and cons of the methodology. 

Thank you so much for writing up this thorough and well-thought-out writeup, Eli and Charlotte! I think there is a high probability that TAI will be developed soon, and it is excellent that you have brainstormed your opinions on the implications for alignment with such detail.

Regarding your point on interpreting honesty and other concepts from the AI's internal states: I wanted to share a recent argument I've made that the practical upside of interpretability may be lower than originally thought. This is because the interaction dynamics between the AI agent and its environment will likely be subject to substantial complexity and computational irreducibility.

Thank you so much, Beth, for your extremely insightful comment! I really appreciate your time.

I completely agree with everything you said. I agree that "you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge," and that these insights will be very useful for alignment research. 

I also agree that "it's difficult to identify what a human's intentions are just by having access to their brain." This was actually the main point I wanted to get across; I guess it wasn't clearly communicated. Sorry about the confusion!

My assertion was that in order to predict the interaction dynamics of a computationally irreducible agent with a complex deployment environment, there are two realistic options:

  1. Run the agent in an exact copy of the environment and see what happens. 
  2. If the deployment environment is unknown, use the available empirical data to develop a simplified model of the system based on parsimonious first principles that are likely to be valid even in the unknown deployment environment. The predictions yielded by such models have a chance of generalizing out-of-distribution, although they will necessarily be limited in scope.

When researchers try to predict intent from internal data, their assumptions/first principles (based on the limited empirical data they have) will probably not be guaranteed to be "valid even in the unknown deployment enviroment." Hence, there is little robust reason to believe that the predictions based on these model assumptions will be generalizable out-of-distribution.

Thank you so much for your detailed and insightful response, Esben! It is extremely informative and helpful.

So as far as I understand your text, you argue that fine-grained interpretability loses out against "empiricism" (running the model) because of computational intractability.

I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!

The main benefit of interpretability, if it can succeed, is that one can predict harmful future behavior (that would have occurred when deployed out-of-distribution) by probing internal data. This allows the researchers to preemptively prevent the harmful behavior: for example, by retraining after detecting deceptive intent. If this is scientifically possible, it would be a substantial benefit, especially since it is generally difficult to obtain out-of-distribution predictions from atheoretical empiricism.

However, I am skeptical that interpretability can achieve nontrivial success in out-of-distribution predictions, especially in the amount of time alignment researchers will realistically have. The reason is that deceptive intent is likely a fine-grained trait at the internal-data level (rather than at the behavioral level). Consequently, computational irreducibility is likely to impose a hard bound on predicting deceptive intent out-of-distribution, at least when assuming realistic amounts of time and resources.

My guess is that detecting deceptive intent solely from a neural net's internal data is probably at least as fine-grained as behavioral genetics or neuroscience. These fields have made some progress, but preemptively predicting behavioral traits from internal data remains mostly unsolved.

For example, consider a question analogous to that of deceptive misalignment: 'Is the given genome optimized for inclusive fitness, or is it optimized for a proxy goal that deviates from inclusive fitness in certain historically unprecedented environments?' We know that evolutionary pressures select for maximizing inclusive fitness. However, the genome is optimized not for inclusive fitness, but for a proxy goal (survive and engage in sexual intercourse) that deviates from inclusive fitness in environments that are sufficiently distinct from ancestral environments.

How did scientists find out that the genome is optimized for a proxy goal? Almost entirely from behavior. We have a coarse-grained behavioral model that is quite good and generalizable. Evolution shaped animals' behavior towards a drive for sexual intercourse, but historically unprecedented environmental changes (e.g., widespread availability of birth control) has made this proxy goal distinct from inclusive fitness. Parsimonious models based on first principles that are likely to be correct, like the above one, have a realistic chance of achieving situation-specific predictability that generalizes out-of-distribution.

In contrast, there is still very little understanding of which genes interact to cause animals' sex drive. Which genes affect sex drive? Probably a substantial proportion of them, and they probably interact in interconnected and nonlinear ways (including with the extremely complex, multidimensionally varying environment) to produce behavioral traits in an unpredictable manner. Moreover, a lot of the information needed to predict behavioral traits like sex drive will lie in the specific environment and how it interacts with the genome. Only the most coarse-grained of these interaction dynamics will be predictable via bypassing empiricism with a statistical-mechanics-like model, due to computational irreducibility. And such a coarse-grained model will likely be rooted in behavior-based abstractions.

Deep-learning neural nets do come with an advantage lacked by behavioral genetics and neuroscience: a potentially complete knowledge of the internal data, the environmental data, and the data of their interaction throughout the whole training process. 

But there is a missing piece: complete knowledge of the deployment environment. Any internals-based model of deceptive intent that alignment researchers can come up with is only guaranteed to hold in the subset of environments that the researchers have empirically tested. In the subset of environments that the polycausal model has not been tested in, there is no a priori reason that the model will generalize correctly. A barrier to generalizability is posed by the nonlinear and interconnected interactions between the neural net's internals and the unprecedented environment, which can and likely will manifest differently depending on the environment. Relaxed adversarial training can help test a wider variety of environments, but this is still hampered by the blind spot of being unable to test the subset of environments that cannot be instantiated at human-level capabilities (e.g., the environment in which RSA encryption is broken). Thus, my guess is that the intrinsic out-of-distribution predictability of the AGI neural net's behavior would be low, just like that of behavioral genetics or neuroscience.

For a conceptual example, consider the fact that the dynamics of cellular automata can change drastically with just one cell’s change in the initial conditions. See Figure 1 of Beckage et al. (Code 1599 in Wolfram's A New Kind of Science), reproduced below:

In general, the only way to accurately ascertain how a computationally irreducible agent will behave in a complex environment is to run it in that environment. Even with complete knowledge of the agent's internals, incomplete knowledge of the environment is sufficient to constrain a priori predictability. I expect that many predictions yielded by interpretability tools in the pre-deployment environment will fail to generalize to the post-deployment environment, unless the two are equal.

It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it's a matter of speed seems completely fine but this is another argument and isn't emphasized in the text.

Sorry for the miscommunication! I meant to say that the rate at which mechanistic interpretability will yield useful, generalizable information is slow, not zero. 

But this is sufficient for concern because informational channels are dual-use; the AGI can use it for sandbox escape. We should only open an interpretability channel if the rate of scientific benefit exceeds the rate of cost (risk of premature sandbox escape by a misaligned AGI).

My opinion is that while mechanistic interpretability has made some progress, the rate at which this progress is happening is not fast enough to solve alignment in a short amount of time and computational resources. So far, the rate of progress in interpretability research has been substantially outpaced by that in AI capabilities research. I think this was predictable, due to what we know about computational irreducibility.

Maybe it's just my misunderstanding of what you mean by fine-grained interpretability, but we don't need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.

Roughly speaking, there is a spectrum between high-complexity paradigms of design (e.g., deep learning) and low-complexity, modular paradigms of design (e.g., software design by a human software engineer). My guess is that for many complex tasks, the optimal equilibrium strategy can be achieved only by the former, and attempting to meaningfully move towards the latter end of the spectum will result in sacrificing performance. For example, I expect that we won't be able to build AGI via modular software design by a human software engineer, but that we will be able to build it by deep learning. 

Again, thank you for the post and I always like when people cite McElreath, though I don't see his arguments apply as well to interpretability since we don't model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan's work.

In Ethan's scaling law, extrapolatory generalization is only guaranteed to be valid locally ("perfectly extrapolate until the next break"), and not globally. This is completely consistent with my prior. My assertion was that in order to globally extrapolate empirical findings to an unknown deployment environment, only simple models have a nontrivial chance of working (assuming realistic amounts of time and computational resources). These simple models will likely be based on parsimonious first principles that we have strong reason to be valid even in the unknown environment. And consequently, they will likely be largely based on behavioral data rather than the internal data of the agent-environment interaction dynamics.

Thank you so much for your insightful and detailed response, Beren! I really appreciate your time.

The cruxes seem very important to investigate.

This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth. 

It probably helps to have the AGI's architecture hand-designed to be more human-interpretable. My model is that on the spectrum of high-complexity paradigms (e.g., deep learning) to low-complexity paradigms (e.g., software design by a human software engineer), having the AGI's architecture be hand-designed moves away from the former and towards the latter, which helps reduce computational irreducibility and thereby increase out-of-distribution predictability (e.g., on questions like "Is the model deceptive?").

However, my guess is that in order for out-of-distribution predictability of the system to be nontrivial, one would need to go substantially towards the low-complexity end of the spectrum. This would make it unlikely for the model to achieve AGI-like capabilities.

What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc.

It would be great if we can get a statistical-mechanics-like theory. Realistically, such a theory would probably be a combination of "small-to-middle-number systems because local interactions are quite relevant in affecting system behavior. In this case, there would be a general lack of the homogeneous mixing necessary for a purely statistical mechanics view to be applicable."

Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system.

I can't rule it out (gene editing is the optimistic outside view here), but I expect such interventions to be imperfect and vulnerable to unforseeable side effects. These interventions would likely require immense trial-and-error to achieve nontrivial out-of-distribution predictability.

From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.

I agree with the characterization "not at the pace probably required for alignment on short timelines but at a decent level for a scientific field." Indeed, the main point I wanted to make was my skepticism that interpretability research on AGI-scale models will progress fast enough to solve alignment in a short amount of time.

I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness. 

I completely agree, as long as the lines of defense are net-positive rather than net-negative. An information channel should only be opened if it helps the alignment researchers (to align the AGI) more than it helps the likely by-default misaligned AGI (to prematurely escape its sandbox).

This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation. 

My guess is that the reward-risk tradeoff of empirical channels—empiricism in the deployment environment or a precise replica of the deployment environment—is more likely to be favorable than that of interpretability channels. This is because the reward is high (empiricism efficiently yields predictive information) while ways to mitigate the risk are human-understandable (as opposed to "complexity is the enemy of security"). 

If the AGI doesn't have prior knowledge of our interpretability plans and how to exploit them (e.g., by deceiving us into a false sense of security), then I would be much more optimistic about the reward-risk tradeoff of interpretability channels! This is why we should keep our interpretability plans inacessible to the AGI as much as possible. Since there is plausibly a significant amount of secrecy value in interpretability plans, we should:

  1. preemptively keep our interpretability plans off of the Internet,
  2. filter these plans from the AGI's training set, and 
  3. ensure that the AGI cannot inspect its own internal data.

I have been campaigning for these measures, but many AI safety researchers have generally opposed adopting Measure #1. I hope this will change going forward!

Thank you very much for the detailed and insightful post, Lee, Sid, and Beren! I really appreciate it.

In the spirit of full communication, I'm writing to share my recent argument that mechanistic interpretability may not be a reliable safety plan for AGI-scale models.

It would be really helpful to hear your thoughts on it!

Load More