All of TheMcDouglas's Comments + Replies

Examples of practical implications of Judea Pearl's Causality work

Probably the best explanation of this comes from John Wentworth's recent AXRP podcast, and a few of his LW posts. To put it simply, modularity is important because modular systems are usually much more interpretable (case in point: evolution has produced highly modular designs, e.g. organs and organ systems, whereas genetic algorithms for electronic circuit design frequently fail to find designs that are modular, and so they're really hard for humans to interpret, and verify that they'll work as expected). If we understood a bit more about the factors that... (read more)

Examples of practical implications of Judea Pearl's Causality work

This may not exactly answer the question, but I'm in a research group which is studying selection for modularity, and yesterday we published our fourth post, which discusses the importance of causality in developing a modularity metric.

TL;DR - if you want to measure information exchanged in a network, you can't just observe activations, because two completely separate tracks of the network measuring the same thing will still have high mutual information even though they're not communicating with each other (the input is a confounder for both of them). Inst... (read more)

What is selection for modularity and why is it important?
What Is The True Name of Modularity?

I guess another point here is that we won't know how different (for example) our results when sampling from the training distribution will be from our results if we just run the network on random noise and then intervene on neurons; this would be an interesting thing to experimentally test. If they're very similar, this neatly sidesteps the problem of deciding which one is more "natural", and if they're very different then that's also interesting

Deconfusing Landauer's Principle

Yeah I think the key point here more generally (I might be getting this wrong) is that C represents some partial state of knowledge about X, i.e. macro rather than micro-state knowledge. In other words it's a (non-bijective) function of X. That's why (b) is true, and the equation holds.

What are your recommendations for technical AI alignment podcasts?

A few of Scott Alexander's blog posts (made into podcast episodes) are really good (he's got a sequence summarising the late 2021 MIRI conversations; the Bio Anchors and Takeoff Speeds ones I found especially informative & comprehensible). These doesn't make up the bulk of content and isn't super technical but thought I'd mention it anyway

Note that the full 2021 MIRI conversations are also available (in robot voice) in the Nonlinear Library [] archive.
Framings of Deceptive Alignment

Yeah I think this is Evan's view. This is from his research agenda (I'm guessing you might have already seen this given your comment but I'll add it here for reference anyway in case others are interested)

I suspect we can in fact design transparency metrics that are robust to Goodharting when the only optimization pressure being applied to them is coming from SGD, but cease to be robust if the model itself starts actively trying to trick them.

And I think his view on deception through inner optimisation pressure is that this is something we'll basically be... (read more)

Framings of Deceptive Alignment

Okay I see, yep that makes sense to me (-:

[$20K in Prizes] AI Safety Arguments Competition

Source: original, but motivated by trying to ground WFLL1-type scenarios in what we already experience in the modern world, so heavily based on this. Also the original idea came from reading Neel Nanda’s “Bird's Eye View of AI Alignment - Threat Models"

Intended audience: mainly policymakers

A common problem in the modern world is when incentives don’t match up with value being produced for society. For instance, corporations have an incentive to profit-maximise, which can lead to producing value for consumers, but can also involve less ethical strategies su
... (read more)
Please make more submissions! If EA orgs are looking for good metaculus prediction records, they'll probably look for evidence of explanatory writing on AI as well. You can put large numbers of contest entries on your resume, to prove that you're serious about explaining AI risk.
Framings of Deceptive Alignment

Thanks for the post! I just wanted to clarify what concept you're pointing to with use of the word "deception".

From Evan's definition in RFLO, deception needs to involve some internal modelling of the base objective & training process, and instrumentally optimising for the base objective. He's clarified in other comments that he sees "deception" as only referring to inner alignment failures, not outer (because deception is defined in terms of the interaction between the model and the training process, without introducing humans into the picture). This... (read more)

Yeah, I totally agree. My motivation for writing the first section was that people use the word 'deception' to refer to both things, and then make what seem like incorrect inferences. For example, current ML systems do the 'Goodhart deception' thing, but then I've heard people use this to imply that it might be doing 'consequentialist deception'. These two things seem close to unrelated, except for the fact that 'Goodhart deception' shows us that AI systems are capable of 'tricking' humans.
How I use Anki: expanding the scope of SRS

Oh wow, I wish I'd come across that plugin previously, that's awesome! Thanks a bunch (-:

How I use Anki: expanding the scope of SRS

Sorry for forgetting to reply to this at first!

There are 2 different ways I create code cards, one is in Jupyter notebooks and one is the "normal way", i.e. by using the Anki editor. I've just created a GitHub describing the second one:

Please let me know if there's anything unclear here!

1Edward Rees4mo
Thanks very much - thoose templates are working great for me! I wasn't able to use autohotkey as I'm on a mac but was able to use this plugin [] to automate the code block/input field creation in a similar way.
How I use Anki: expanding the scope of SRS

Thanks! Yeah so there is one add-on I use for tag management. It's called Search and Replace Tags, basically you can select a bunch of cards in the browser and Ctrl+Alt+Shift+T to change them. When you press that, you get to choose any tag that's possessed by at least one of the cards you're selecting, and replace it with any other tag.

There are also built-in Anki features to add, delete, and clear unused tags (to find those, right-click on selected cards in the browser, and hover over "Notes"). I didn't realise those existed for a long time, was pretty annoyed when I found them! XD

Hope this helps!

Great, that's exactly what I needed. I bookmarked your post and will use it for sure, thanks for all the effort
Project Intro: Selection Theorems for Modularity
It seems like an environment that changes might cause modularity. Though, aside from trying to make something modular, it seem like it could potentially fall out of stuff like 'we want something that's easier to train'.

This seems really interesting in the biological context, and not something we discussed much in the other post. For instance, if you had two organisms, one modular and one not modular, even if there's currently no selection advantage for the modular one, it might just be trained much faster and hence be more likely to hit on a good solution before the nonmodular network (i.e. just because it's searching over parameter space at a larger rate).

Or the less modular one can't train (evolve) as fast when the environment changes. (Or, it changes faster enabling it to travel to different environments.) Biology kind of does both (modular and integrated), a lot. Like, I want to say part of why the brain is hard to understand is because of how integrated it is. What's going on in the brain? I saw one answer to this that says 'it is this complicated in order to obfuscate, to make it harder to hack, this mess has been shaped by parasites, which it is designed to shake off, that is why it is a mess, and might just throw some neurotoxin in there. Why? To kill stiff that's trying to mess around in there.' (That is just from memory/reading a reviews on a blog, and you should read the paper/later work [] I want to say integrated a) (often) isn't as good (separating concerns is better), but b) it's cheaper to re-use stuff, and have it solve multiple purposes. Breathing through the same area you drink water/eat food through can cause issues. But integrating also allows improvements/increased efficiency (although I want to say, in systems people make, it can make it harder to refine or improve the design).
Project Intro: Selection Theorems for Modularity
Reasoning: Training independent parts to each perform some specific sub-calculation should be easier than training the whole system at once.

Since I've not been involved in this discussion for as long I'll probably miss some subtlety here, but my immediate reaction is that "easier" might depend on your perspective - if you're explicitly enforcing modularity in the architecture (e.g. see the "Direct selection for modularity" section of our other post) then I agree it would be a lot easier, but whether modular systems are selected for when they're being trai... (read more)

Project Intro: Selection Theorems for Modularity

Yep thanks! I would imagine if progress goes well on describing modularity in an information-theoretic sense, this might help with (2), because information entanglement between a single module and the output would be a good measure of "relevance" in some sense

The Natural Abstraction Hypothesis: Implications and Evidence

Thanks for the comment!

Subtle point: I believe the claim you're drawing from was that it's highly likely that the inputs to human values (i.e. the "things humans care about") are natural abstractions.

To check that I understand the distinction between those two: inputs to human values are features of the environment around which our values are based. For example, the concept of liberty might be an important input to human values because the freedom to exercise your own will is a natural thing we would expect humans to want, whereas humans can dif... (read more)