Hello! I work at Lightcone and like LessWrong :-). I have made some confidentiality agreements I can't leak much metadata about (like who they are with). I have made no non-disparagement agreements.
Mod note: this post is personal rather than frontpage because event/course/workshop/org... announcements are generally personal, even if the content of the course, say, is pretty clearly relevant to the frontpage (as in this case)
I believe it includes some older donations:
Mod note: I've put this on Personal rather than Frontpage. I imagine the content of these talks will be frontpage content, but event announcements in general are not.
neural networks routinely generalize to goals that are totally different from what the trainers wanted
I think this is slightly a non sequitor. I take Tom to be saying "AIs will care about stuff that is natural to express in human concept-language" and your evidence to be primarily about "AIs will care about what we tell it to", though I could imagine there being some overflow evidence into Tom's proposition.
I do think the limited success of interpretability is an example of evidence against Tom's proposition. For example, I think there's lots of work where you try and replace an SAE feature or a neuron (R) with some other module that's trying to do our natural language explanation of what R was doing, and that doesn't work.
I dug up my old notes on this book review. Here they are:
So, I've just spent some time going through the World Bank documents on its interventions in Lesotho. The Anti-Politics Machine is not doing great on epistemic checking
- There is no recorded Thaba-Tseka Development Project, despite the period in which it should have taken place being covered
- There is a Thaba-Bosiu development project (parts 1 and 2) taking place at the correct time.
- Thaba-Bosiu and Thaba-Tseka are both regions of Lesotho
- The spec doc for Thaba-Bosiu Part 2 references the alleged problems the economists were faced with (remittances from South African miners, poor crop yield ... no complaint about cows)
- It has a negative assessment doc at the end. It was an unsuccessful project. This would match
- The funding doesn't quite match up. The UK is mentioned as funding the "Thaba-Tseka" project, and is indeed funding Thaba-Bosiu. But Canada is I believe funding a road project instead
- Something like 2/3 of the country is involved in Thaba-Bosiu Development II (It became renamed the "Basic Agricultural Services Program")
- There is no mention of ponies or wood involved in interventions anywhere. In fact, the part II retrospective includes the lack of focus on livestock as a problem (suggesting they didn't do much of it)
- They were focused on five major crops (maize, sorghum, beans, peas and wheat)
- Also the quote in the book review of the quote in The Anti-Politics Machine of the quote in the report doesn't show up in any of the documents I looked at (which basically covered every project in Lesotho by the World Bank in that time period). The writing style of the quote is also moderately distinct from that of the reports
- AFAICT, the main intervention was fertiliser. The retrospective claims this failed because (a) the climate in Lesotho is uniquely bad and screened off fertilisation and (b) the Lesotho government fucked up messaging and also every other part of everything all the time and ultimately all the donors backed out.
- The government really wanted to be self-sufficient in food production. None of the donors, the farmers or the world bank cared about this but the government focused its messaging heavily around this. The government ended up directing a lot of its efforts towards a new Food Self-Sufficiency Program which was seen as incompatible with the goals of Basic Agricultural Services Program.
- The fact that the crop situation wasn't working was recognised fairly early on. They started on an adaptive trial of crop research to figure out what would work better. This was hampered by donor coordination so only happened in a small area, but apparently worked quite well
All-in-all, sounds less bad than the Anti-Politics Machine makes it out to be, and also just generally very different? I'm not 100% certain I've managed to locate all the relevant programs though, so it's possible something closer to the book's description did happen
I think 2023 was perhaps the peak for discussing the idea that neural networks have surprisingly simple representations of human concepts. This was the year of Steering GPT-2-XL by adding an activation vector, cheese vectors, the slightly weird lie detection paper and was just after Contrast-consistent search.
This is a pretty exciting idea, because if it’s easy to find human concepts we want (or don’t want) networks to possess, then we can maybe use that to increase the chance that systems that are honest, kind, loving (and can ask them questions like “are you deceiving me?” and get useful answers).
I don’t think the idea is now definitively refuted or anything, but I do think a particular kind of lazy version of the idea, more popular in the Zeitgeist, perhaps, than amongst actual proponents, has fallen out of favour.
CCS seemed to imply an additional proposition, which is that you can get even more precise identification of human concepts by encoding some properties of the concept you’re looking for into the loss function. I was kind of excited about this, because things in this realm are pretty powerful tools for specifying what you care about (like, it rhymes with axiom-based definition or property-based testing).
But actually, if you look at the numbers they report, that’s not really true! As this post points out, basically all their performance is recoverable by doing PCA on contrast pairs.[1]
I like how focused and concise this post is, while still being reasonably complete.
There’s another important line of criticism of CCS, which is about whether its “truth-like vector” is at all likely to track truth, rather than just something like “what a human would believe”. I think posts like What Discovering Latent Knowledge Did and Did Not Find address this somewhat more directly than this one.
But I think, for me, the loss function had some mystique. Most of my hope was that encoding properties of truth into the loss function would help us find robust measures of what a model thought was true. So I think this post was the main one that made me less excited about both CCS and take a bit more of a nuanced view about the linearity of human concept representations.
Though I admit I’m a little confused about how to think about the fact that PCA happens to have pretty similar structure to the CCS loss. Maybe for features that have less confidence/consistency-shaped properties, shaping the loss function would be more important.
Yup
I'm not sure I understand what you're driving at, but as far as I do, here's a response: I have lots of concepts and abstractions over the physical world (like chair). I don't have many concepts or abstractions over strings of language, apart from as factored through the physical world. (I have some, like register or language, but they don't actually feel that "final" as concepts).
As far as factoring my predictions of language through the physical world, a lot of the simplest and most robust concepts I have are just nouns, so they're already represented by tokenisation machinery, and I can't do interesting interp to pick them out.
The general rule is roughly "if you write a frontpage post which has an announcement at the end, that can be frontpaged". So for example, if you wrote a post about the vision for Online Learning, that included as a relatively small part the course announcement, that would probably work.
By the way, posts are all personal until mods process them, usually around twice a day. So that's another reason you might sometimes see posts landing on personal for awhile.