David Reber — LessWrong

In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.

Based on playing around recently with a similar setup (but only toy examples), I'm actually surprised you get only 85%, as I've only observed NDE=0 when I freeze the entire reasoning_trace.

My just-so explanation for this was that whenever the reasoning trace includes the conclusion (that is, the bolded text in your example), then freezing the reasoning trace preserves the final conclusion. Put another way, the <recommendation> is ~deterministically determined by <reasoning>, which suggests a strong bias towards seeing low direct effects.

If this just-so story is true, it suggests that we might need a more granular mediator than the <entire reasoning_trace>, if possible

Some Rules for an Algebra of Bayes Nets

David Reber2y30

Ah that's right. Thanks that example is quite clarifying!

Some Rules for an Algebra of Bayes Nets

David Reber2y40

also, it appears that the two diagrams in the Frankenstein Rule section differ in their d-separation of (x_1 \indep x_4 | x_5) (which doesn't hold in the the left), so these are not actually equivalent (we can't have an underlying distribution satisfy both of these diagrams)

Some Rules for an Algebra of Bayes Nets

David Reber2y52

The theorems in this post all say something like "if the distribution (approximately) factors according to <some DAGs>, then it also (approximately) factors according to <some other DAGs>"

So one motivating research question might be phrased as "Probability distributions have an equivalence class of Bayes nets / causal diagrams which are all compatible. But what is the structure within a given equivalence class? In particular, if we have a representative Bayes net of an equivalence class, how might we algorithmically generate other Bayes nets in that equivlance class?"

Some Rules for an Algebra of Bayes Nets

David Reber2yΩ230

Could you clarify how this relates to e.g. the PC (Peter-Clark) or FCI (Fast Causal Inference) algorithms for causal structure learning?

Like, are you making different assumptions (than e.g. minimality, faithfulness, etc)?

Introduction to Towards Causal Foundations of Safe AGI

David Reber3y30

So the contributions of vnm theory are shrunken down into "intention"?

(Background: I consider myself fairly well-read w.r.t. causal incentives, not very familiar with vnm theory, and well-versed in Pearlian causality. I have gotten a sneak peak at this sequence so have a good sense of what's coming)

I'm not sure I understand VNM theory, but I would suspect the relationship is more like "VNM theory and <this agenda> are two takes on how to reason about the behavior of agents, and they both refer to utilities and Bayesian networks, but have important differences in their problem statements (and hence, in their motivations, methodologies, exact assumptions they make, etc)".

I'm not terribly confident in that appraisal at the moment, but perhaps it helps explain my guess for the next question:

Will you recapitulate that sort of framing (such as involving the interplay between total orders and real numbers)

Based on my (decent?) level of familiarity with the causal incentives research, I don't think there will be anything like this. Just because two research agendas use a few of the same tools doesn't mean they're answering the same research questions, let alone sharing methodologies.

...or are you feeling more like it's totally wrong and should be thrown out?

When two different research agendas are distinct enough (as I suspect VNM and this causal-framing-of-AGI-safety are), their respective success/failures are quite independent. In particular, I don't think the authors' choice to pursue this research direction over the last few years should be taken by itself as a strong commentary on VNM.

But maybe I didn't fully understand your comment, since I haven't read up on VNM.

Shutdown-Seeking AI

David Reber3yΩ020

Distinguish two types of shutdown goals: temporary and permanent. These types of goals may differ with respect to entrenchment. AGIs that seek temporary shutdown may be incentivized to protect themselves during their temporary shutdown. Before shutting down, the AGI might set up cyber defenses that prevent humans from permanently disabling it while ‘asleep’. This is especially pressing if the AGI has a secondary goal, like paperclip manufacturing. In that case, protection from permanent disablement increases its expected goal satisfaction. On the other hand, AGIs that desire permanent shutdown may be less incentivized to entrench.

It seems like an AGI built to desire permanent shutdown may have an incentive to permanently disempower humanity, then shut down. Otherwise, there's a small chance that humanity may revive the AGI, right?

Steering GPT-2-XL by adding an activation vector

David Reber3yΩ140

Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven't prioritized a blog post about the paper so it makes sense that this community isn't familiar with it.

The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as Word-to-Vec.

Importantly, the paper comes with some theoretical investigation into why this might be the case, including articulating necessary assumptions/conditions (which this purely-empirical post does not).

I conjecture that the reason that <some activation additions in this post fail to have the desired effect> may be because they violate some conditions analogous to those in Concept Algebra: it feels a bit deja-vu to look at section E.1 in the appendix, of some empirical results which fail to act as expected when the conditions of completeness and causal separability don't hold.

EIS V: Blind Spots In AI Safety Interpretability Research

David Reber3yΩ010

Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn't you agree that asking "how do we do causality when we don't even know what level abstraction on which to define causal variables?" is beyond the "usual pearl causality story" as usually summarized in FFS posts? It certainly goes beyond Pearl's well-known works.

EIS V: Blind Spots In AI Safety Interpretability Research

David Reber3yΩ010

I don't think my claim is that "FFS is already subsumed by work in academia": as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning).

It was intentional that the linked paper is an intro survey paper to the Pearl-ish approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question "what does it mean to study causality if we don't have pre-defined variables?"

It may be that FFS ends up contributing novel insights above and beyond <Pearl-based causal rep. learning>, but a priori I expect this to occur only if FFS researchers are familiar with the existing literature, which I haven't seen mentioned in any FFS posts.

My line of thinking is: It's hard to improve on a field you aren't familiar with. If you're ignorant of the work of hundreds of other researchers who are trying to answer the same underlying question you are, odds are against your insights being novel / neglected.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments