TheMcDouglas

Sequences

Wiki Contributions

Comments

Deconfusing Landauer's Principle

Yeah I think the key point here more generally (I might be getting this wrong) is that C represents some partial state of knowledge about X, i.e. macro rather than micro-state knowledge. In other words it's a (non-bijective) function of X. That's why (b) is true, and the equation holds.

What are your recommendations for technical AI alignment podcasts?

A few of Scott Alexander's blog posts (made into podcast episodes) are really good (he's got a sequence summarising the late 2021 MIRI conversations; the Bio Anchors and Takeoff Speeds ones I found especially informative & comprehensible). These doesn't make up the bulk of content and isn't super technical but thought I'd mention it anyway

Framings of Deceptive Alignment

Yeah I think this is Evan's view. This is from his research agenda (I'm guessing you might have already seen this given your comment but I'll add it here for reference anyway in case others are interested)

I suspect we can in fact design transparency metrics that are robust to Goodharting when the only optimization pressure being applied to them is coming from SGD, but cease to be robust if the model itself starts actively trying to trick them.

And I think his view on deception through inner optimisation pressure is that this is something we'll basically be powerless to deal with once it happens, so the only way to make sure it doesn't happen it to chart a safe path through model space which never enters the deceptive region in the first place.

Framings of Deceptive Alignment

Okay I see, yep that makes sense to me (-:

[$20K in Prizes] AI Safety Arguments Competition

Source: original, but motivated by trying to ground WFLL1-type scenarios in what we already experience in the modern world, so heavily based on this. Also the original idea came from reading Neel Nanda’s “Bird's Eye View of AI Alignment - Threat Models"

Intended audience: mainly policymakers

A common problem in the modern world is when incentives don’t match up with value being produced for society. For instance, corporations have an incentive to profit-maximise, which can lead to producing value for consumers, but can also involve less ethical strategies such as underpaying workers, regulatory capture, or tax avoidance. Laws & regulations are designed to keep behaviour like this in check, and this works fairly well most of the time. Some reasons for this are: (1) people have limited time/intelligence/resources to find and exploit loopholes in the law, (2) people usually follow societal and moral norms even if they’re not explicitly represented in law, and (3) the pace of social and technological change has historically been slow enough for policymakers to adapt laws & regulations to new circumstances. However, advancements in artificial intelligence might destabilise this balance. To return to the previous example, an AI tasked with maximising profit might be able to find loopholes in laws that humans would miss, they would have no particular reason to pay attention to societal norms, and they might be improving and becoming integrated with society at a rate which makes it difficult for policy to keep pace. The more entrenched AI becomes in our society, the worse these problems will get.
Framings of Deceptive Alignment

Thanks for the post! I just wanted to clarify what concept you're pointing to with use of the word "deception".

From Evan's definition in RFLO, deception needs to involve some internal modelling of the base objective & training process, and instrumentally optimising for the base objective. He's clarified in other comments that he sees "deception" as only referring to inner alignment failures, not outer (because deception is defined in terms of the interaction between the model and the training process, without introducing humans into the picture). This doesn't include situations like the first one, where the reward function is underspecified to produce behaviour we want (although it does produce behaviour that looks like it's what we want, unless we peer under the hood).

To put it another way, it seems like the way deception is used here refers to the general situation where "AI has learnt to do something that humans will misunderstand / misinterpret, regardless of whether the AI actually has an internal representation of the base objective it's being trained on and the humans doing the training."

In this situation, I don't really know what the benefit is of putting these two scenarios into the same class, because they seem pretty different. My intuitions about this might be wrong though. Also I guess this is getting into the inner/outer alignment distinction which opens up quite a large can of worms!

How I use Anki: expanding the scope of SRS

Oh wow, I wish I'd come across that plugin previously, that's awesome! Thanks a bunch (-:

How I use Anki: expanding the scope of SRS

Sorry for forgetting to reply to this at first!

There are 2 different ways I create code cards, one is in Jupyter notebooks and one is the "normal way", i.e. by using the Anki editor. I've just created a GitHub describing the second one:

https://github.com/callummcdougall/anki_templates

Please let me know if there's anything unclear here!

How I use Anki: expanding the scope of SRS

Thanks! Yeah so there is one add-on I use for tag management. It's called Search and Replace Tags, basically you can select a bunch of cards in the browser and Ctrl+Alt+Shift+T to change them. When you press that, you get to choose any tag that's possessed by at least one of the cards you're selecting, and replace it with any other tag.

There are also built-in Anki features to add, delete, and clear unused tags (to find those, right-click on selected cards in the browser, and hover over "Notes"). I didn't realise those existed for a long time, was pretty annoyed when I found them! XD

Hope this helps!

Project Intro: Selection Theorems for Modularity
It seems like an environment that changes might cause modularity. Though, aside from trying to make something modular, it seem like it could potentially fall out of stuff like 'we want something that's easier to train'.

This seems really interesting in the biological context, and not something we discussed much in the other post. For instance, if you had two organisms, one modular and one not modular, even if there's currently no selection advantage for the modular one, it might just be trained much faster and hence be more likely to hit on a good solution before the nonmodular network (i.e. just because it's searching over parameter space at a larger rate).

Load More