AI alignment certification and peacebuilding seem like two very different and distinct projects. I'd strongly suggest picking one.
I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.
I also notice that I am just afraid of what would happen if I were to e.g. write a post that's just like "an overview over the EA-ish/X-risk-ish policy landscape" that names specific people and explains various historical plans. Like I expect it would make me a lot of enemies.
This seems like a bad idea.
Transparency is important, but ideally, we would find ways to increase this without blowing up a bunch of trust within the community. I guess I'd question whether this is really the bottleneck in terms of transparency/public trust.
I'm worried that as a response to FTX we might end up turning this into a much more adversarial space.
I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source."
The problem with this plan is that it assumes that there are easy ways to robustify the world. What if the only proper defense against bioweapons is a complete monitoring of the entire internet? Perhaps this is something that we'd like to avoid. In this scenario, your plan would likely lead to someone coming up with a fake plan to robustify the world and then claim that it'd be fine for them to release their model as open-source, because people really want to do open-source.
For example, in your plan you write:
Then you set a reasonable time-frame for the vulnerability to be patched: In the case of SHA-1, the patch was "stop using SHA-1" and the time-frame for implementing this was 90 days.
This is exactly the kind of plan that I'm worried about. People will be tempted to argue that surely 4 years is enough time for the biodefense plan to be implemented, four years rolls around and it's clearly not in place, but then they push for release anyway.
I'll go into more detail later, but as an intuition pump imagine that: the best open source model is always 2 years behind the best proprietary model
You seem to have hypothesised what is to me an obviously unsafe scenario. Let's suppose our best proprietary models hit upon a dangerous bioweapon capability. Well, now we only have two years to prepare for it, regardless of whether this is completely wildly unrealistic. Worse, this occurs for each and every dangerous capability.
Will evaluators be able to anticipate and measure all of the novel harms from open source AI systems? Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post.
When we're talking about risk management, a 50% chance that a key assumption will work out, when there isn't a good way to significantly reduce this uncertainty often doesn't translate into a 50% chance of it being a good plan, but rather a near 0% chance.
For the record, I updated on ChatGPT. I think that the classic example of imagining telling an AI to get a coffee and it pushes a kid out of the way isn't so much of a concern any more. So the remaining concerns seem to be inner alignment + outer alignment far outside normal human experience + value lock-in.
Thanks for highlighting the issue with the discourse here. People use the word evidence in two different ways which often results in people talking past one another.
I'm using your broader definition, where I imagine that Stella is excluding things that don't meet a more stringent standard.
And my claim is that reasoning under uncertainty sometimes means making decisions based on evidence[1] that is weaker than we'd like.
Broad definition
Oh, I don't think it actually would end up being temporary, because I expect with high probability that the empirical results of more robust evaluations would confirm that open-source AI is indeed dangerous. I meant temporary in the sense that the initial restrictions might either a) have a time-limit b) be subjective to re-evaluation at a specified point.
Define evidence.
I'm not asking this just to be pedantic, but because I think it'll make the answer to your objection clearer.
E/acc seems to be really fired up about this:
https://twitter.com/ctjlewis/status/1725745699046948996