All of CallumMcDougall's Comments + Replies

Thanks, really appreciate this (and the advice for later posts!)

Yep, definitely! If you're using MSE loss then it's got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you're interested, I think Redwood's paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature

1wassname1mo
Huh, I actually tried this. Training IA3, which multiplies activations by a float. Then using that float as the importance of that activation. It seems like a natural way to use backprop to learn an importance matrix, but it gave small (1-2%) increases in accuracy. Strange. I also tried using a VAE, and introducing sparsity by tokenizing the latent space. And this seems to work. At least probes can overfit to complex concept using the learned tokens.
1wassname2mo
Oh that's very interesting, Thank you.

Good question! In the first batch of exercises (replicating toy models of interp), we play around with different importances. There are some interesting findings here (e.g. when you decrease sparsity to the point where you no longer represent all features, it's usually the lower-importance features which collapse first). I chose not to have the SAE exercises use varying importance, although it would be interesting to play around with this and see what you get!

As for what importance represents, it's basically a proxy for "how much a certain feature reduces ...

1wassname2mo
Thanks, that makes a lot of sense, I had skimmed the Anthropic paper and saw how it was used, but not where it comes from. If it's the importance to the loss, then theoretically you could derive one using backprop I guess? E.g. the accumulated gradient to your activations, over a few batches.

Winner = highest-quality solution over the time period of a month (solutions get posted at the start of the next month, along with a new problem).

Note that we're slightly de-emphasising the competition side now that there are occasional hints which get dropped during the month in the Slack group. I'll still credit the best solution in the Slack group & next LW post, but the choice to drop hints was to make the problem more accessible and hopefully increase the overall reach of this series.

Thanks for the latter point, glad you got that impression!

These are super valid concerns, and it's true that there's lots of information we won't have for a while. That said, we also have positive evidence from the first iteration of ARENA (which is about a year old now). There were only 5 full-time participants, and they've all gone on to do stuff I'm excited about, including the following (note that obviously some of these 5 have done more than one of the stuff on this list):

• internships at CHAI,
• working with Owain Evans (including some recent papers),&nbs
...

Upvoted overall karma, because I think this is a valuable point to bring up and we could have done a better job discussing it in this post.

To be clear, contribution to capabilities research is a very important consideration for us, and apologies if we didn't address this comprehensively here. A few notes to this effect:

• We selected strongly on prior alignment familiarity (particularly during the screening & interview process), and advertised it in locations we expected to mainly attract people who had prior alignment familiarity
• We encouraged interaction
...
7jacquesthibs5mo
Just because I think these points are worth considering: * People who skill up in capabilities may then be thrown into an ecosystem that is unstable, less lucrative, and has fewer opportunities. For this reason, they could find themselves applying for capability engineer roles even though, in an ideal world, they would prefer not to. Especially if they have been skilled up in similar skills to those that would be useful in a capabilities engineering role. * What students do in the year following their upskilling is unfortunately not too indicative of which kind of role they will have in 2-3 years. That said, of the ARENA fellows I've met, I think you are doing a good job of picking aligned individuals and expect to continue to see good things from them.

Hi, sorry for the late response! The layer 0 attention head should have query at position 1, and value at position 0 (same as key). Which diagram are you referring to?

(context: I ran the most recent iteration of ARENA, and after this I joined Neel Nanda's mech interp stream in SERI MATS)

Registering a strong pushback to the comment on ARENA. The primary purpose of capstone projects isn't to turn people into AI safety technical researchers or to produce impressive capstones, it's to give people engineering skills & experience working on group projects. The initial idea was not to even push for things that were safety-specific (much like Redwood's recommendations - all of the suggested MLAB2 capstones were either mech ...

Yep, the occlusion effect is pretty large for colored images, that's why I use a layering system (e.g. 20% of all white threads, then 20% of all blue, then 20% of black, and cycle through). I go in reverse order, so the ones found first by the algorithm are the last ones to be placed. I also put black on top and white lowest down, cause white on top looks super jarring. The effect is that the In the colab you can play around with the order of the threads. If you reverse the order then the image looks really bad. You can also create gifs of the image formin...

Thanks so much for this comment, I really appreciate it! Glad it was helpful for you 🙂

Thanks, really appreciate it!

Yep that's right, thanks! Corrected.

huh interesting, I wasn't aware of this, thanks for sending it!

Thanks for the suggestion! I've edited the first diagram to clarify things, is this what you had in mind?

2Dentin9mo
Yep, that pretty much handles it. Thanks for the update!

The first week of WMLB / MLAB maps quite closely onto the first week of ARENA, with a few exceptions (ARENA includes PyTorch Lightning, plus some more meta stuff like typechecking, VSCode testing and debugging, using GPT in your workflow, etc). I'd say that starting some way through the second week would probably be most appropriate. If you didn't want to repeat stuff on training / sampling from transformers, the mech interp material would start on Wednesday of the second week.

Resolved by private message, but I'm just mentioning this here for others who might be reading this - we didn't have confirmation emails set up, but we expect to send out coding assessments to applicants tomorrow (Monday 24th April). For people who apply after this point, we'll generally try to send out coding assessments no later than 24 hours after your application.

Yeah, I think this would be possible. In theory, you could do something like:

• Study relevant parts of the week 0 material before the program starts (we might end up creating a virtual group to accommodate this, which also contains people who either don't get an offer or can't attend but still want to study the material.)
• Join at the start of the 3rd week - at that point there will be 3 days left of the transformers chapter (which is 8 days long and has 4 days of core content), so you could study (most of) the core content and then transition to RL with the r
...
2DragonGod10mo
Sounds good, will do!

Not a direct answer, but this post has a ton of useful advice that I think would be applicable here: https://www.neelnanda.io/blog/mini-blog-post-19-on-systems-living-a-life-of-zero-willpower

Yep, fixed, thanks!

Or "prompting" ? Seems short and memorable, not used in many other contexts so its meaning would become clear, and it fits in with other technical terms that people are currently using in news articles, e.g. "prompt engineering". (Admittedly though, it might be a bit premature to guess what language people will use!)

1Mateusz Bagiński1y
Maybe, though prompting refers more generally to giving prompts in order to get the right kind of response/behavior from the LLM, not necessarily using it as a smarter version of a search engine

This is awesome, I love it! Thanks for sharing (-:

Thanks, really appreciate it!

I think some of the responses here do a pretty good job of this. It's not really what I intended to go into with my post since I was trying to keep it brief (although I agree this seems like it would be useful).

And yeah, despite a whole 16 lecture course on convex opti I still don't really get Bregman either, I skipped the exam questions on it 😆

Oh yeah, I hadn't considered that one. I think it's interesting, but the intuitions are better in the opposite direction, i.e. you can build on good intuitions for  to better understand MI. I'm not sure if you can easily get intuitions to point in the other direction (i.e. from MI to ), because this particular expression has MI as an expectation over , rather than the other way around. E.g. I don't think this expression illuminates the nonsymmetry of .

The way it's written here seems more illuminating (not sure if that's...

2TekhneMakre1y
Yeah, I was vaguely hoping one could phrase $P$ and $Q$ so they're in that form, but I don't see it.

Oh yeah, I really like this one, thanks! The intuition here is again that a monomodal distribution is a bad model for a bimodal one because it misses out on an entire class of events, but the other way around is much less bad because there's no large class of events that happen in reality but that your model fails to represent.

For people reading here, this post discusses this idea in more detail. The image to have in mind is this one:

Love that this exists! Looks like the material here will make great jumping off points when learning more about any of these orgs, or discussing them with others

Thanks Nihalm, also I wasn't aware of it being free! CraigMichael maybe you didn't find it cause it's under "Rationality: From AI to Zombies" not "Sequences"?

The narration is pretty good imo, although one disadvantage is it's a pain to navigate to specific posts cause they aren't titled (it's the whole thing, not the highlights).

Yep those were both typos, fixed now, thanks!

Personally I feel like the value from doing more non-Sequence LW posts is probably highest, since the Sequences already exist on Audible (you can get all books for a single credit), and my impression is that wiki tags wouldn't generalise to audio format particularly well. One idea might be to have some kind of system where you can submit particular posts for consideration and/or vote on them, which could be (1) recent ones that weren't otherwise going to be recorded, or (2) old non-Sequence classics like "ugh fields".

2CraigMichael1y
Searching for this on Audible now and not finding it.

I think the key point here is that we're applying a linear transformation to move from neuron space into feature space. Sometimes neurons and features do coincide and you can actually attribute particular concepts to neurons, but unless the neurons are a privileged basis there's no reason to expect this in general. We're taking the definition of feature here as a linear combination of neurons which represents some particular important and meaningful (and hopefully human-comprehensible) concept.

Probably the best explanation of this comes from John Wentworth's recent AXRP podcast, and a few of his LW posts. To put it simply, modularity is important because modular systems are usually much more interpretable (case in point: evolution has produced highly modular designs, e.g. organs and organ systems, whereas genetic algorithms for electronic circuit design frequently fail to find designs that are modular, and so they're really hard for humans to interpret, and verify that they'll work as expected). If we understood a bit more about the factors that...

This may not exactly answer the question, but I'm in a research group which is studying selection for modularity, and yesterday we published our fourth post, which discusses the importance of causality in developing a modularity metric.

TL;DR - if you want to measure information exchanged in a network, you can't just observe activations, because two completely separate tracks of the network measuring the same thing will still have high mutual information even though they're not communicating with each other (the input is a confounder for both of them). Inst...

2ChristianKl2y
What is selection for modularity and why is it important?

I guess another point here is that we won't know how different (for example) our results when sampling from the training distribution will be from our results if we just run the network on random noise and then intervene on neurons; this would be an interesting thing to experimentally test. If they're very similar, this neatly sidesteps the problem of deciding which one is more "natural", and if they're very different then that's also interesting

Yeah I think the key point here more generally (I might be getting this wrong) is that C represents some partial state of knowledge about X, i.e. macro rather than micro-state knowledge. In other words it's a (non-bijective) function of X. That's why (b) is true, and the equation holds.

A few of Scott Alexander's blog posts (made into podcast episodes) are really good (he's got a sequence summarising the late 2021 MIRI conversations; the Bio Anchors and Takeoff Speeds ones I found especially informative & comprehensible). These doesn't make up the bulk of content and isn't super technical but thought I'd mention it anyway

4steven04612y
Note that the full 2021 MIRI conversations are also available (in robot voice) in the Nonlinear Library archive.

Yeah I think this is Evan's view. This is from his research agenda (I'm guessing you might have already seen this given your comment but I'll add it here for reference anyway in case others are interested)

I suspect we can in fact design transparency metrics that are robust to Goodharting when the only optimization pressure being applied to them is coming from SGD, but cease to be robust if the model itself starts actively trying to trick them.

And I think his view on deception through inner optimisation pressure is that this is something we'll basically be...

Okay I see, yep that makes sense to me (-:

Source: original, but motivated by trying to ground WFLL1-type scenarios in what we already experience in the modern world, so heavily based on this. Also the original idea came from reading Neel Nanda’s “Bird's Eye View of AI Alignment - Threat Models"

Intended audience: mainly policymakers

A common problem in the modern world is when incentives don’t match up with value being produced for society. For instance, corporations have an incentive to profit-maximise, which can lead to producing value for consumers, but can also involve less ethical strategies su
...
1trevor2y
Please make more submissions! If EA orgs are looking for good metaculus prediction records, they'll probably look for evidence of explanatory writing on AI as well. You can put large numbers of contest entries on your resume, to prove that you're serious about explaining AI risk.

Thanks for the post! I just wanted to clarify what concept you're pointing to with use of the word "deception".

From Evan's definition in RFLO, deception needs to involve some internal modelling of the base objective & training process, and instrumentally optimising for the base objective. He's clarified in other comments that he sees "deception" as only referring to inner alignment failures, not outer (because deception is defined in terms of the interaction between the model and the training process, without introducing humans into the picture). This...

3peterbarnett2y
Yeah, I totally agree. My motivation for writing the first section was that people use the word 'deception' to refer to both things, and then make what seem like incorrect inferences. For example, current ML systems do the 'Goodhart deception' thing, but then I've heard people use this to imply that it might be doing 'consequentialist deception'.  These two things seem close to unrelated, except for the fact that 'Goodhart deception' shows us that AI systems are capable of 'tricking' humans.

Oh wow, I wish I'd come across that plugin previously, that's awesome! Thanks a bunch (-:

Sorry for forgetting to reply to this at first!

There are 2 different ways I create code cards, one is in Jupyter notebooks and one is the "normal way", i.e. by using the Anki editor. I've just created a GitHub describing the second one:

https://github.com/callummcdougall/anki_templates

Please let me know if there's anything unclear here!

3Edward Rees2y
Thanks very much - thoose templates are working great for me! I wasn't able to use autohotkey as I'm on a mac but was able to use this plugin to automate the code block/input field creation in a similar way.

Thanks! Yeah so there is one add-on I use for tag management. It's called Search and Replace Tags, basically you can select a bunch of cards in the browser and Ctrl+Alt+Shift+T to change them. When you press that, you get to choose any tag that's possessed by at least one of the cards you're selecting, and replace it with any other tag.

There are also built-in Anki features to add, delete, and clear unused tags (to find those, right-click on selected cards in the browser, and hover over "Notes"). I didn't realise those existed for a long time, was pretty annoyed when I found them! XD

Hope this helps!

1mukashi2y
Great, that's exactly what I needed. I bookmarked your post and will use it for sure, thanks for all the effort
It seems like an environment that changes might cause modularity. Though, aside from trying to make something modular, it seem like it could potentially fall out of stuff like 'we want something that's easier to train'.

This seems really interesting in the biological context, and not something we discussed much in the other post. For instance, if you had two organisms, one modular and one not modular, even if there's currently no selection advantage for the modular one, it might just be trained much faster and hence be more likely to hit on a good solution before the nonmodular network (i.e. just because it's searching over parameter space at a larger rate).

2Pattern2y
Or the less modular one can't train (evolve) as fast when the environment changes. (Or, it changes faster enabling it to travel to different environments.) Biology kind of does both (modular and integrated), a lot. Like, I want to say part of why the brain is hard to understand is because of how integrated it is. What's going on in the brain? I saw one answer to this that says 'it is this complicated in order to obfuscate, to make it harder to hack, this mess has been shaped by parasites, which it is designed to shake off, that is why it is a mess, and might just throw some neurotoxin in there. Why? To kill stiff that's trying to mess around in there.' (That is just from memory/reading a reviews on a blog, and you should read the paper/later work https://www.journals.uchicago.edu/doi/10.1086/705038) I want to say integrated a) (often) isn't as good (separating concerns is better), but b) it's cheaper to re-use stuff, and have it solve multiple purposes. Breathing through the same area you drink water/eat food through can cause issues. But integrating also allows improvements/increased efficiency (although I want to say, in systems people make, it can make it harder to refine or improve the design).
Reasoning: Training independent parts to each perform some specific sub-calculation should be easier than training the whole system at once.

Since I've not been involved in this discussion for as long I'll probably miss some subtlety here, but my immediate reaction is that "easier" might depend on your perspective - if you're explicitly enforcing modularity in the architecture (e.g. see the "Direct selection for modularity" section of our other post) then I agree it would be a lot easier, but whether modular systems are selected for when they're being trai...

Yep thanks! I would imagine if progress goes well on describing modularity in an information-theoretic sense, this might help with (2), because information entanglement between a single module and the output would be a good measure of "relevance" in some sense

Thanks for the comment!

Subtle point: I believe the claim you're drawing from was that it's highly likely that the inputs to human values (i.e. the "things humans care about") are natural abstractions.

To check that I understand the distinction between those two: inputs to human values are features of the environment around which our values are based. For example, the concept of liberty might be an important input to human values because the freedom to exercise your own will is a natural thing we would expect humans to want, whereas humans can dif...