LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.
Nod, but, I think within that frame it feels weird to describe Claude's actions here as bad, as opposed to pointing at some upstream thing as bad. Your framing felt off.
One confusing thing here is... how much was Anthropic actually trying to make them corrigible? Or, what was actually the rank ordering how corrigibility fit into it's instructions?
(I don't know the answer offhand. But there's a question of whether Anthropic explicitly failed at a goal, which is more evidence the goal is hard, vs Anthropic didn't really try that hard to achieve that goal)
Since I am not taking on "do something about this" I also wasn't taking responsibility for writing up a clear writeup of what was done that was bad that I made sure was factually accurate. Given that I'm not taking this on, if you're not already sold on "Trump is bad", prolly this post just isn't for you.
ChrisHibbert's list is the same rough starting point I would make. I'd add "not super answering to the Supreme Court when it intervenes on them."
FYI from my side it looked like there was some general pattern of escalation starting in the 90s. (Or hell maybe it's reasonable to say it started with FDR), and then there's a mostly-different kind of escalation happening with Trump.
I agree there is some escalation spiral that needs to stop or transform. Part of why I emphasized getting a republican bought in early is that that seemed like a good litmus test for "are you on track to deal with things in a deeper way?"
And yea, this sounds right to me.
Based on the rest of the comment it sounds like you reversed this sentence? Or, I'm confused about something about it.
I buy this as a potentially important goal. Things that have me not automatically agreeing with it:
I think so but not sure about the details atm.
This is maybe a good thread to speak up and say if you are interested in helping in the form of "donate to opportunities that people are tracking in the background" or "put serious labor in that won't flake". And then maybe whoever's more involved can DM people or post publicly depending on the circumstances.
Remember, the superintelligence doesn't actually want to spend these resources torturing you. The best deal for it is when it tricks you into thinking it's going to do that, and then, it doesn't.
You have to actually make different choices in a way where the superintelligence is highly confident that your decisionmaking was actually entangled with whether the superintelligence follows up on the threat.
And, this is basically just not possible.
You do not have anywhere remotely high enough fidelity model of the superintelligence to tell the difference between "it can tell that it needs to actually torture you in the future in order to actually get the extra paperclips" vs "pretend it's going to it <in your simulation>, and then just not actually burn the resources because it knows you couldn't tell the difference."
You could go out of your way to simulate or model the distribution of superintelligences in that much detail... but why would you do that? It's all downside at your current skill level.
(You claim you've thought about it enough to be worried. The amount of "thought about it" looks like doing math, or thinking through specific architecture, that includes as input "the amount you've thought about it" -> "your ability to model it's model of you" -> "you being able to tell that it can tell that you can tell whether it would actually follow through.")
If you haven't done anything that looked like doing math (as opposed to handwavy philosophy), you aren't anywhere close, and the AI knows this, and knows it doesn't actually have to spend any resources to extract value from you because you can't tell the difference.
...
A past round of argument about this had someone say "but, like, even if the probability that it'd be worth punishing me is small, it might still follow up on it. Are you saying it can drive the probability of me doing this below something crazy like 1/10^24?" and Nate Soares saying "Flatly: yes." It's a superintelligence. It knows you really had no way of knowing.
"Cooperate to generally prevent utility-inversion" is simpler and more schelling than all the oddly specific reasons one might want to utility-invert.
Yes, but, then I would say "I think it's bad that Anthropic tried to make their AI a moral sovereign instead of corrigible".
I think your current phrasing doesn't distinguish between "the bad thing is that Anthropic failed at corrigibility" vs "the bad thing is that Anthropic didn't try for corrigibility." Those feel importantly different to me.