Chris_Leong

Sequences

Linguistic Freedom: Map and Territory Revisted
INVESTIGATIONS INTO INFINITY

Comments

Gabriel makes a very good point: there is something of a tension between allowing reign of terror moderation and considering it a norm violation to request the deletion of comments for low quality.

(TBH, I was convinced that reign of terror would be a disaster, but it seems to be working out okay so far).

What do you think the 101 space should look like for Less Wrong?

E/acc seems to be really fired up about this:

https://twitter.com/ctjlewis/status/1725745699046948996

I read their other comments and I'm skeptical. The tone is wrong.

AI alignment certification and peacebuilding seem like two very different and distinct projects. I'd strongly suggest picking one.

I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.

I also notice that I am just afraid of what would happen if I were to e.g. write a post that's just like "an overview over the EA-ish/X-risk-ish policy landscape" that names specific people and explains various historical plans. Like I expect it would make me a lot of enemies.

This seems like a bad idea.

Transparency is important, but ideally, we would find ways to increase this without blowing up a bunch of trust within the community. I guess I'd question whether this is really the bottleneck in terms of transparency/public trust.

I'm worried that as a response to FTX we might end up turning this into a much more adversarial space.

I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source." 


The problem with this plan is that it assumes that there are easy ways to robustify the world. What if the only proper defense against bioweapons is a complete monitoring of the entire internet? Perhaps this is something that we'd like to avoid. In this scenario, your plan would likely lead to someone coming up with a fake plan to robustify the world and then claim that it'd be fine for them to release their model as open-source, because people really want to do open-source.

For example, in your plan you write:

Then you set a reasonable time-frame for the vulnerability to be patched: In the case of SHA-1, the patch was "stop using SHA-1" and the time-frame for implementing this was 90 days.

This is exactly the kind of plan that I'm worried about. People will be tempted to argue that surely 4 years is enough time for the biodefense plan to be implemented, four years rolls around and it's clearly not in place, but then they push for release anyway.

I'll go into more detail later, but as an intuition pump imagine that: the best open source model is always 2 years behind the best proprietary model

You seem to have hypothesised what is to me an obviously unsafe scenario. Let's suppose our best proprietary models hit upon a dangerous bioweapon capability. Well, now we only have two years to prepare for it, regardless of whether this is completely wildly unrealistic. Worse, this occurs for each and every dangerous capability.

Will evaluators be able to anticipate and measure all of the novel harms from open source AI systems? Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post.

When we're talking about risk management, a 50% chance that a key assumption will work out, when there isn't a good way to significantly reduce this uncertainty often doesn't translate into a 50% chance of it being a good plan, but rather a near 0% chance.

For the record, I updated on ChatGPT. I think that the classic example of imagining telling an AI to get a coffee and it pushes a kid out of the way isn't so much of a concern any more. So the remaining concerns seem to be inner alignment + outer alignment far outside normal human experience + value lock-in.

Thanks for highlighting the issue with the discourse here. People use the word evidence in two different ways which often results in people talking past one another.

I'm using your broader definition, where I imagine that Stella is excluding things that don't meet a more stringent standard.

And my claim is that reasoning under uncertainty sometimes means making decisions based on evidence[1] that is weaker than we'd like.

  1. ^

    Broad definition

Load More