If there are reasons to refuse bets in general, that apply to the LessWrong community in aggregate, something has gone horribly horribly wrong.
No one is requiring you personally to participate, and I doubt anyone here is going to judge you for reluctance to engage in bets with people from the Internet who you don't know. Certainly I wouldn't. But if no one took up this bet, it would have a meaningful impact on my view of the community as a whole.
I don't know how it prevents us from dying either! I don't have a plan that accomplishes that; I don't think anyone else does either. If I did, I promise I'd be trying to explain it.
That said, I think there are pieces of plans that might help buy time, or might combine with other pieces to do something more useful. For example, we could implement regulations that take effect above a certain model size or training effort. Or that prevent putting too many flops worth of compute in one tightly-coupled cluster.
One problem with implementing those regulations is that there's disagreement about whether they would help. But that's not the only problem. Other problems are things like: how hard would they be to comply with and audit compliance with? Is compliance even possible in an open-source setting? Will those open questions get used as excuses to oppose them by people who actually object for other reasons?
And then there's the policy question of how we move from the no-regulations world of today to a world with useful regulations, assuming that's a useful move. So the question I'm trying to attack is: what's the next step in that plan? Maybe we don't know because we don't know what the complete plan is or whether the later steps can work at all, but are there things that look likely to be useful next steps that we can implement today?
One set of answers to that starts with voluntary compliance. Signing an open letter creates common knowledge that people think there's a problem. Widespread voluntary compliance provides common knowledge that people agree on a next step. But before the former can happen, someone has to write the letter and circulate it and coordinate getting signatures. And before the latter can happen, someone has to write the tools.
So a solutionism-focused approach, as called for by the post I'm replying to, is to ask what the next step is. And when the answer isn't yet actionable, break that down further until it is. My suggestion was intended to be one small step of many, that I haven't seen discussed much as a useful next step.
I think neither. Or rather, I support it, but that's not quite what I had in mind with the above comment, unless there's specific stuff they're doing that I'm not aware of. (Which is entirely possible; I'm following this work only loosely, and not in detail. If I'm missing something, I would be very grateful for more specific links to stuff I should be reading. Git links to usable software packages would be great.)
What I'm looking for mostly, at the moment, is software tools that could be put to use. A library, a tutorial, a guide for how to incorporate that library into your training run, and a result of better compliance with voluntary reporting. What I've seen so far is mostly high-effort investigative reports and red-teaming efforts.
Best practices around how to evaluate models and high-effort things you can do while making them are also great. But I'm specifically looking for tools that enable low effort compliance and reporting options while people are doing the same stuff they otherwise would be. I think that would complement the suggestions for high-effort best practices.
The output I'd like to see is things like machine-parseable quantification of flops used to generate a model, such that a derivative model would specify both total and marginal flops used to create it.
One thing I'd like to see more of: attempts at voluntary compliance with proposed plans, and libraries and tools to support that.
I've seen suggestions to limit the compute power used on large training runs. Sounds great; might or might not be the answer, but if folks want to give it a try, let's help them. Where are the libraries that make it super easy to report the compute power used on a training run? To show a Merkle tree of what other models or input data that training run depends on? (Or, if extinction risk isn't your highest priority, to report which media by which people got incorporated, and what licenses it was used under?) How do those libraries support reporting by open-source efforts, and incremental reporting?
What if the plan is alarm bells and shutdowns of concerning training runs? Or you're worried about model exfiltration by spies or rogue employees? Are there tools that make it easy to report what steps you're taking to prevent that? That make it easy to provide good security against those threat models? Where's the best practices guide?
We don't have a complete answer. But we have some partial answers, or steps that might move in the right direction. And right now actually taking those next steps, for marginal people kinda on the fence about how to trade capabilities progress against security and alignment work, looks like it's hard. Or at least harder than I can imagine it being.
(On a related note, I think the intersection of security and alignment is a fruitful area to apply more effort.)
Aren't the other used cars available nearby, and the potential other buyers should you walk away, relevant to that negotiation?
This was fantastic; thank you! I still haven't quite figured it out, I'll definitely have to watch it a second time (or at least some parts of it).
I think some sort of improved interface for your math annotations and diagrams would be a big benefit, whether that's a drawing tablet or typing out some LaTeX or something else.
I think the section on induction heads and how they work could have used a bit more depth. Maybe a couple more examples, maybe some additional demos of how to play around with PySvelte, maybe something else. That's the section I had the most trouble following.
You mentioned a couple additional papers in the video; having links in the description would be handy. I suspect I can find them easily enough as it is, though.
Yes, if Omega accurately simulates me and wants me to be wrong, Omega wins. But why do I need to get the answer exactly "right"? What does it matter if I'm slightly off?
This would be a (very slightly) more interesting problem if Omega was offering a bet or a reward and my goal was to maximize reward or utility or whatever. It sure looks like for this setup, combined with a non-adversarial reward schedule, I can get arbitrarily close to maximizing the reward.
This feel reminiscent of:
If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.
And while it's a well-constructed pithy quote, I don't think it's true. Can a system understand itself? Can a quining computer program exist? Where is the line between being able to recite itself and understand itself?
You need a model above some threshold of capability at which it can provide useful interpretations, yes, but I don't see any obvious reason why that threshold would move up with the size of the model under interpretation.
Agreed. A quine needs some minimum complexity and/or language / environment support, but once you have one it's usually easy to expand it. Things could go either way, and the question is an interesting one needing investigation, not bare assertion.
And the answer might depend fairly strongly on whether you take steps to make the model interpretable or a spaghetti-code turing-tar-pit mess.
I think that sounds about right. Collecting the arguments in one place is definitely helpful, and I think they carry some weight as initial heuristics, which this post helps clarify.
But I also think the technical arguments should (mostly) screen off the heuristics; the heuristics are better for evaluating whether it's worth paying attention to the details. By the time you're having a long debate, it's better to spend (at least some) time looking instead of continuing to rely on the heuristics. Rhymes with Argument Screens Off Authority. (And in both cases, only mostly screens off.)
My concern with conflating those two definitions of alignment is largely with the degree of reliability that's relevant.
The definition "does what the developer wanted" seems like it could cash out as something like "x% of the responses are good". So, if 99.7% of responses are "good", it's "99.7% aligned". You could even strengthen that as something like "99.7% aligned against adversarial prompting".
On the other hand, from a safety perspective, the relevant metric is something more like "probabilistic confidence that it's aligned against any input". So "99.7% aligned" means something more like "99.7% chance that it will always be safe, regardless of who provides the inputs, how many inputs they provide, and how adversarial they are".
In the former case, that sounds like a horrifyingly low number. What do you mean we only get to ask the AI 300 things in total before everyone dies? How is that possibly a good situation to be in? But in the latter case, I would roll those dice in a heartbeat if I could be convinced the odds were justified.
So anyway, I still object to using the "alignment" term to cover both situations.