Zach Stein-Perlman — LessWrong

Researching donation opportunities. Previously: ailabwatch.org.

Relatedly, I fear that outside of Redwood and Forethought, the "AI Consciousness and Welfare" field is focused on the stuff in this post plus advocacy rather than stuff I like: (1) making deals with early schemers to reduce P(AI takeover) and (2) prioritizing by backchaining from considerations about computronium and von Neumann probes.

Edit: here's a central example of a proposition in an important class of propositions/considerations that I expect the field basically just isn't thinking about and lacks generators to notice:

In the long run, when we're colonizing the galaxies, the crucial thing is that we fill the universe with axiologically-good minds. In the short run, what matters is more about being cooperative with the AIs (and maybe the small scale means deontology is more relevant); the AIs' preferences, not scope-sensitive direct axiological considerations, are what matters.

I think making deals with AIs is applicable outside of just trading with autonomous AIs! In particular making deals with AIs to reveal their misalignment or do useful safety work despite being misaligned. It's not about comparative advantage.

Concept: inconvenience and flinching away.

I've been working for 3.5 years. Until two months ago, I did independent-ish research where I thought about stuff and tried to publicly write true things. For the last two months, I've been researching donation opportunities. This is different in several ways. Relevant here: I'm working with a team, and there's a circle of people around me with some beliefs and preferences related to my work.

I have some new concepts related to these changes. (Not claiming novelty.)

First is "flinching away": when I don't think about something because it's awkward/scary/stressful. I haven't noticed my object-level beliefs skewed by flinching away, but I've noticed that certain actions should be priorities but come to mind less than they should. In particular: doing something about a disagreement with the consensus or status quo or something an ally is doing. I maybe fixed this by writing the things I'm flinching away from in a google doc when I notice them so I don't forget (and now having the muscle of noticing similar things). It would still be easier if it wasn't the case that (it's salient to me that) certain conclusions and actions are more popular than others.

Second is convenience. Convenience is orthogonal to truth. Examples of convenient things are: considerations in favor of conclusions my circle agrees with; considerations that make the answer look more clear/overdetermined. I might flinch away from inconvenient upshots for my actions. (I think I'm not biased by convenience—i.e., not doing motivated reasoning—when forming object-level beliefs within the scope of my day-to-day work; possibly I'm unusually good at this.) I've noticed myself saying stuff like "conveniently, X" and "Y, that's inconvenient" a lot recently. Noticing feelings about considerations has felt helpful but I don't know/recall why. Possibly this is mostly when thinking about stuff with others: saying "that's convenient" is a flag to check whether it's suspiciously convenient; saying "that's inconvenient" is a flag to make sure to take it seriously.

You suspect someone in your community is a bad actor. Kinds of reasons not to move against them:

You're uncertain
1. Especially if your uncertainty will likely be largely resolved soon
You lack legible evidence (or other ways of convincing others), and they're not already seen as sketchy
1. Especially if you'll likely get better legible evidence soon
They're doing some good stuff; you need them for some good stuff
They're popular, politically powerful, or have power to hurt you
1. And so you'd fail
  1. And maybe you'd get kicked out or lose power (especially if your move is unpopular or considered inappropriate, or "community" is more like "team" or "circle")
2. And so making them an enemy of [you or your community] is costly
Personal psychological costs
It's just time-consuming to start a fight, especially because they'll be invested in discrediting you and your claims

This is all from-first-principles. I'm interested in takes and reading recommendations. Also creative affordances / social technology.

The convenient case would be that you can privately get the powerful stakeholders on board and then they all oust the bad actor and it's a fait accompli, so there's no protracted conflict and minimal cost to you and your community. If you can't get the powerful stakeholders on board, I guess you give up and just privately share information with people you trust — unless you're sufficiently powerful/independent/ostracized that it's cheap for you to make enemies, or it's sufficiently important, or you can share information anonymously. If you're scared about telling the powerful stakeholders, I guess it's the same situation.

This is assuming that the main audience for your beliefs is some powerful stakeholders. Sometimes it's more like many community members.

I agree but don't feel very strongly. On Anthropic security, I feel even more sad about this.

If [the insider exception] does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!

You may be interested in ailabwatch.org/resources/corporate-documents, which links to a folder where I have uploaded ~all past versions of the CoI. (I don't recommend reading it, although afaik the only lawyers who've read the Anthropic CoI are Anthropic lawyers and advisors, so it might be cool if one independent lawyer read it from a skeptical/robustness perspective. And I haven’t even done a good job diffing the current version from a past version; I wasn’t aware of the thing Drake highlighted.)

I guess so! Is there reason to favor logit?

Yep, e.g. donations sooner are better for getting endorsements. Especially for Bores and somewhat for Wiener, I think.

Maybe the logistic success curve should actually be the cumulative normal success curve.

There's often a logistic curve for success probabilities, you know? The distances are measured in multiplicative odds, not additive percentage points. You can't take a project like this and assume that by putting in some more hard work, you can increase the absolute chance of success by 10%. More like, the odds of this project's failure versus success start out as 1,000,000:1, and if we're very polite and navigate around Mr. Topaz's sense that he is higher-status than us and manage to explain a few tips to him without ever sounding like we think we know something he doesn't, we can quintuple his chances of success and send the odds to 200,000:1. Which is to say that in the world of percentage points, the odds go from 0.0% to 0.0%. That's one way to look at the “law of continued failure”.
If you had the kind of project where the fundamentals implied, say, a 15% chance of success, you’d then be on the right part of the logistic curve, and in that case it could make a lot of sense to hunt for ways to bump that up to a 30% or 80% chance.

Takeaways from the Eleos Conference on AI Consciousness and Welfare

Zach Stein-Perlman8d*20

Zach Stein-Perlman8d50

Zach Stein-Perlman's Shortform

Zach Stein-Perlman11d471

Zach Stein-Perlman11d333

Anthropic is (probably) not meeting its RSP security commitments

Zach Stein-Perlman15d70

Drake Thomas's Shortform

Zach Stein-Perlman22d*242

The Tale of the Top-Tier Intellect

Zach Stein-Perlman1mo20

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments