do you mean the only way to meaningfully answer this would need access to non-public data
That, unfortunately. Frontier labs rarely ever share research that helps improve the capabilities of frontier models (this will vary between the lab, of course, and many are still good about publishing commercially useful safety work)
This analysis is confounded by the fact that GDM has a lot more non Gemini stuff (eg the science work) than the other labs. None of the labs publish most of their LLM capabilities work, but publishing science stuff is fine, so DeepMind having more other stuff means comparatively more non safety work gets published
I generally think you can't really answer this question with the data sources you're using, because IMO the key question is what fraction of the frontier LLM oriented with is on safety, but little of that is published.
Out-of-context reasoning, the phenomenon where models can learn much more general, unifying structure when fine-tuned on something fairly specific, was a pretty important update to my mental model of how neural networks work.
This paper wasn't the first, but it was one of the more clean and compelling early examples (though emergent misalignment is now the most famous).
After staring at it for a while, I now feel less surprised by out-of-context reasoning. Mechanistically, there's no reason the model couldn't learn the generalizing solution. And on a task like this, the generalizing solution is just simpler and more efficient, and there's gradient signal for it (at least if there's a linear representation), so it's natural to learn it. But it's easy to say something's obvious in hindsight. I would not have predicted this, and it has improved my mental model of neural networks. I think this is important and valuable!
I like this post. It's a simple idea that was original to me, and seems to basically work.
In particular, it seems able to discover things about a model we might not have expected. I generally think that each additional unsupervised technique, ie that can discover unexpected insights, is valuable, because each additional technique is another shot on goal that might find what's really going on. So the more the better!
I have not, in practice, seen MELBO used that much, which is a shame. But I think the core idea seems sound
I feel pretty great about this post. It likely took five to ten hours of my time, and I think it has been useful to a lot of people. I have pointed many people to this post since writing it, and I imagine many other newcomers to the field have read it.
I generally think there is a gap where experienced researchers can use their accumulated knowledge to create field-building materials fairly easily that are extremely useful to newcomers to the field, but don't (typically because they're busy - see how I haven't updated this yet). I'd love to see more people making stuff like this!
Stylistically, I like that this was opinionated. I wasn't just trying to survey all of the papers, which I think is often not that helpful because lots of papers are not worth reading. Instead, I was editorializing and trying to give summaries, context, and highlight key parts and how to think about the papers, all of which I think make this a more useful guide to newcomers.
One of the annoying things is that posts like this get out of date fairly quickly. This one direly needs an update, both since there has been a year and a half of progress and since my thoughts on interpretability have moved on a reasonable amount since I wrote it. But I think that even getting about a year of use out of this is a solid use of time.
This is probably one of the most influential papers that I've supervised, and my most cited MATS paper (400+ citations).
I find this all very interesting and surprising. This was originally a project on trying to understand the circuit behind refusal, and this was a fallback idea that Andy came up with using some of his partial results to jailbreak a model. Even at the time, I basically considered it standard wisdom that X is a single direction was true for most concepts X. So I didn't think the paper was that big a deal. I was wrong!
So, what to make of all this? Part of the lesson is the importance of finding a compelling application. This paper is now one of my favourite short examples of how interpretability is real: it's an interesting, hard to fake thing that is straightforward to achieve with interpretability techniques. My guess is that this was a large part in capturing people's imaginations. And many people had not come across the idea that concepts are often single directions. This was a very compelling demonstration, and we thus accidentally claimed some credit for that broad finding. I don't think this was a big influence on my shift to caring about pragmatic interpretability, but definitely aligns with that.
Was this net good to publish? This was something we discussed somewhat at the time. I basically thought this was clearly fine on the grounds that it was already well known that you could cheaply fine-tune a model to be jailbroken, so anyone who actually cared would just do that. And for open source models, you only need one person who actually cares for people to use it.
I was wrong on that one - there was real demand! My guess is that the key thing is that this is easier and cheaper to do than finetuning, along with some other people making good libraries and tutorials for it. Especially in low resource open source circles this is very important.
But I do think that jailbroken open models would have been available either way, and this hasn't really made a big difference on that front. I hoped it would make people more aware of the fragility of open source safeguards - it probably did help, but idk if that led to any change. My guess is that all of these impacts aren't that significant, and the impact on the research field dominates.
Sorry I don't quite understand your arguments here. Five tokens back and 10 tokens back are both within what I would consider short-range context EG. They are typically within the same sentence. I'm not saying there are layers that look at the most recent five tokens and then other layers that look at the most recent 10 tokens etc. It's more that early layers aren't looking 100 tokens back as much. The lines being parallelish is consistent with our hypothesis
Ah, thanks, I missed that part.
Thanks for the addendum! I broadly agree with "I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption.", maybe scoping the assumption to my personal life (I very much endorse working on reducing tail risks!)
I disagree with the "a prediction" argument though. Being >50% likely to happen does not mean people shouldn't give significant mental space to the other, less likely outcomes. This is not how normal people live their lives, nor how I think they should. For example, people don't smoke because they want to avoid lung cancer, but their chances of dying of this are well under 50% (I think?). People don't do dangerous extreme sports, even though most people doing them don't die. People wear seatbelts even though they're pretty unlikely to die in a car accident. Parents make all kinds of decisions to protect their children from much smaller risks. The bar for "not worth thinking about" is well under 1% IMO. Of course "can you reasonably affect it" is a big Q. I do think there are various bad outcomes short of human extinction, eg worlds of massive inequality, where actions taken now might matter a lot for your personal outcomes.
I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.
The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.
Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.
When analysing the reasons I broadly buy the deontological framework for "don't commit murder", I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
Regarding your examples, I'm completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.
Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I'm strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what's disanalogous?
COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.
You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.
Again, I'm purely defending the abstract point of "plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden". You're welcome to critique Anthropic's actual actions as much as you like. But you seem to be making a much more general claim.