keith_wynroe

Wiki Contributions

Comments

Sorry for the delay - thanks for this! Yeah I agree, in general the OV circuit seems like it'll be much easier given the fact that it doesn't have the bilinearity or the softmax issue. I think the idea you sketch here sounds like a really promising one and pretty in line with some of the things we're trying atm

I think the tough part will be the next step which is somehow "stitching together" the QK and OV decompositions that give you an end-to-end understanding of what the whole attention layer is doing. Although I think the extent to which we should be thinking about the QK and OV circuit as totally independent is still unclear to me

Interested to hear more about your work though! Being able to replace the entire model sounds impressive given how much reconstruction errors seem to compound 

 

Thanks!

The auxiliary losses were something we settled on quite early, and we made some improvements to the methodology since then for the current results so I don't have great apples-to-apples comparisons for you. The losses didn't seem super important though in the sense that runs would still converge, just take longer and end with slightly worse reconstruction error. I think it's very likely that with a better training set-up/better hyperparam tuning you could drop these entirely and be fine.

Re: comparison to SAE's, you mean what do the dictionaries/feature-map have to look like if you're explicitly targeting L2-reconstruction error and just getting pattern reconstruction as a side-effect? If so we also looked at this briefly early on. We didn't spend a huge amount of time on these so they were probably not optimally trained, but we were finding that to get L2-reconstruction error low enough to yield comparably close good pattern reconstruction we were needing to go up to a d_hidden of 16,000 i.e. comparable to residual SAEs for the same layer. Which I think is another data-point in favour of "a lot of the variance in head-space is attention-irrelevant and just inherited from the residual stream"

This looks really cool! Haven't digested it all yet but I'm especially interested in the QK superposition as I'm working on something similar. I'm wondering what your thoughts are on the number of bigrams being represented by a QK circuit not being bounded by interference but by its interaction with the OV circuit. IIUC it looks like a head can store a surprising number of d_resid bigrams, but since the OV circuit is only a function of the key, then having the same key feature be in a clique with a large number of different query features means the OV-circuit will be unable to differentially copy information based on which bigram is present. I don't think this has been explored outside of toy models from Anthropic though

I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn't actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that's already there

Sorry you found it so stressful! I’m not objecting to you deciding it’s not worth your time to engage, what I’m getting at is a perceived double standard in when this kind of criticism is applied. You say

I do not think that the thing I am observing from Pope/Belrose is typical of LW/AF/rationalist/MIRI/etc behaviors to anything like the same degree that they consistently do it

But this seems wrong to me. The best analogue of your post from Quintin’s perspective was his own post laying out disagreements with Eliezer. Eliezer’s response to this was to say it was too long for him to bother reading, which imo is far worse. AFAICT his response to you in your post is higher-effort than the responses from MIRI people to his arguments all put together. Plausibly we have different clusters in our head of who we’re comparing him too though - I agree a wider set of LW people are much more engaging, I’m specifically comparing to e.g Nate and Eliezer as that feels to me a fairer comparison

To go into the specific behaviours you mention

I basically don't see him changing his mind about anything, agreeing a good point was made

I don’t think this makes sense - if from his perspective you didn’t make good points or change his mind then what was he supposed to do? If you still think you did and he’s not appreciating them then that’s fair but is more reifying the initial disagreement. I also don’t see this behaviour from Eliezer or Nate?

addressing my arguments or thoughts on their merits rather than correcting my interpretation of his arguments, asking me questions, suggesting cruxes and so on.

I again don’t see Eliezer doing any of this either in responses to critical posts?

Where he notes disagreement he says he's baffled anyone could think such a thing and doesn't seem curious why I might think it

Again seems to be a feature of many MIRI-cluster responses. Stating that certain things feel obvious from the inside and that you don’t get why it’s so hard for other people to grok them is a common refrain.

And all of this is asserted as, essentially, obvious and undeniable, extreme confidence is displayed, all the arguments offered against this are invalid and dumb, and those that disagree are at best deeply confused and constantly told they did not understand or fairly represent what was said.

This feels unnecessarily snarky, but is also pretty much exactly the experience a lot of people have trying to engage with Yudkowsky et al. It feels weird to bring up “they’re very confident and say that their critics just don’t get it” as a put-down here.

It seems doubly bad because it really seems like a lot of the more pessimist crowd just genuinely aren’t actually trying to engage with these ideas at all. Nate wrote one skimmed post which badly misread the piece, and Yudkowsky AFAICT has at most engaged via a couple tweets (again which don’t seem to engage with the points). This is concurrent with them both engaging much more heavily with weaker objections to which they already have easy answers.

I genuinely don’t understand why a group which is highly truth-seeking and dispassionately interested in the validity of their very consequential arguments feels so little reason to engage with counter-arguments to their core claims which have been well-received.

I tried one reply to one of Pope’s posts

From your post, you seem to have misunderstood Quintin’s arguments in a way he explains pretty clearly, and then there’s not really much follow-up. You don’t seem to have demonstrated you can pass an ITT after this, and I think if it were Yudkowsky in Pope’s position and someone effectively wrote him off as hopeless after one failed attempt to understand eachother you would probably not be as forgiving.

I understand it’s a proposition like any other, I don’t see why an agent would reflect on it/use it in their deliberation to decide what to do. The fact that they’re a CDT agent is a fact about how they will act in the decision, not a fact that they need to use in their deliberation

Analogous to preferences, whether or not an agent prefers A or B is a proposition like any other, but I don’t think it’s a natural way to model them as first consult the credences they have assigned to “I prefer A to B” etc. Rather, they will just choose A ex hypothesis because that’s what having the preference means.

Why would they be uncertain about whether they’re a CDT agent? Being a CDT agent surely just means by definition they evaluate decisions based on causal outcomes. It feels confused to say that they have to be uncertain about/reflect on which decision theory they have and then apply it, rather than their being a CDT agent being an ex hypothesis fact about how they behave

Why not? Is it common for NDAs/non-disparagement agreements to also have a clause stating the parties aren’t allowed to tell anyone about it? I’ve never heard of this outside of super-injunctions which seems a pretty separate thing

They can presumably confirm whether or not there is a nondisparagement agreement and whether that is preventing them from commenting though right

Load More