LESSWRONG
LW

3895
MalcolmMcLeod
2000410
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Considerations around career costs of political donations
MalcolmMcLeod13d10

If one doesn't plan to go into politics, is there any value in being a bipartisan single-issue donor? How much must one donate for it to be accompanied with a message of "I will vote for whoever is better on AI x-risk"?

Reply
A non-review of "If Anyone Builds It, Everyone Dies"
MalcolmMcLeod1mo10

I like your made up notation. I'll try to answer, but I'm an amateur in both reasoning-about-this-stuff and representing-others'-reasoning-about-this-stuff. 

I think (1) is both inner and outer misalignment. (2) is fragility of value, yes. 

I think the "generalization step is hard" point is roughly "you can get δ low by trial and error. The technique you found at the end that gets δ low---it better not intrinsically depend on the trial and error process, because you don't get to do trial and error on δ'. Moreover, it better actually work on M'." 

Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That's one of their many problems. 

My suggest term for standard MIRI thought would just be Mirism.

I kinda don't like "generalization" as a name for this step. Maybe "extension"? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ' (verbiage different because of the first-time constraint), the disastrousness of even smallish δ'...


 

Reply
A non-review of "If Anyone Builds It, Everyone Dies"
MalcolmMcLeod1mo10

This is an excellent encapsulation of (I think) something different---the "fragility of value" issue: "formerly adequate levels of alignment can become inadequate when applied to a takeover-capable agent." I think the "generalization gap" issue is "those perfectly-generalizing alignment techniques must generalize perfectly on the first try". 

Attempting to deconfuse myself about how that works if it's "continuous" (someone has probably written the thing that would deconfuse me, but as an exercise): if AI power progress is "continuous" (which training is, but model-sequence isn't), it goes from "you definitely don't have to get it right at all to survive" to "you definitely get only one try to get it sufficiently right, if you want to survive," but by what path? In which of the terms "definitely," "one," and "sufficiently" is it moving continuously, if any?

I certainly don't think it's via the number of tries you get to survive! I struggle to imagine an AI where we all die if we fail to align it three times in a row. 

I don't put any stock in "sufficiently," either---I don't believe in a takeover-capable AI that's aligned enough to not work toward takeover, but which would work toward takeover if it were even more capable. (And even if one existed, it would have to eschew RSI and other instrumentally convergent things, else it would just count as a takeover-causing AI.)

It might be via the confidence of the statement. Now, I don't expect AIs to launch highly-contingent outright takeover attempts; if they're smart enough to have a reasonable chance of succeeding, I think they'll be self-aware enough to bide their time, suppress the development of rival AIs, and do instrumentally convergent stuff while seeming friendly. But there is some level of self-knowledge at which an AI will start down the path toward takeover (e.g., extricating itself, sabotaging rivals) and succeed with a probability that's very much neither 0 nor 1. Is this first, weakish, self-aware AI able to extricate itself? It depends! But I still expect the relevant band of AI capabilities here to be pretty narrow, and we get no guarantee it will exist at all. And we might skip over it with a fancy new model (if it was sufficiently immobilized during training or guarded its goals well). 

Of course, there's still a continuity in expectation: when training each more powerful model, it has some probability of being The Big One. But yeah, I more or less predict a Big One; I believe in an essential discontinuity arising here from a continuous process. The best analogy I can think of is how every exponential with r<1 dies out and every r>1 goes off to infinity. When you allow dynamic systems, you naturally get cuspy behavior. 

Reply
The title is reasonable
MalcolmMcLeod1mo10

Hmm. I know nothing about nothing, and you've probably checked this already, so this comment is probably zero-value-added, but according to Da Machine, it sounds like the challenges are surmountable: https://chatgpt.com/share/e/68d55fd5-31b0-8006-aec9-55ae8257ed68 

Reply
This is a review of the reviews
MalcolmMcLeod1mo90

That's fair! 

Reply
This is a review of the reviews
MalcolmMcLeod1mo60

OK, I am rereading what I wrote last night and I see that I really expressed myself badly. It really does sound like I said we shoudl sacrifice our commitment to precise truth. I'll try again: what we should indeed sacrifice is our commitment to being anal-retentive about practices that we think associate with getting the precise truth, over and beyond saying true stuff and contradicting false stuff. where those practices include things like "never appearing to 'rally round anything' in a tribal fashion." Or, at a 20degree angle from that: "doing rhetoric not with an aim toward an external goal, but orienting our rhetoric to be ostentatious in our lack of rhetoric, making all the trappings of our speech scream 'this is a scrupulous, obsessive, nonpartisan autist for the truth.'" Does that make more sense? it's the performative elements that get my goat. (And yes, there are performative elements, unavoidably! All speech has rhetoric because (metaphorically) "the semantic dimensions" are a subspace of speech-space, and speech-space is affine, so there's no way to "set the non-semantic dimensions to zero.")

Reply
This is a review of the reviews
MalcolmMcLeod1mo50

I beg everyone I love not to ride a motorcycle. 

Well, I also have have a few friends who clearly want to go out like a G before they turn 40, friends whose worldviews don't include having kids and growing old---friends who are, basically, adventurers---and they won't be dissuaded. They also free solo daylong 5.11s, so there's only so much I can do. Needless to say, they don't post on lesswrong. 

Reply
This is a review of the reviews
MalcolmMcLeod1mo145

No, I just expressed myself badly. Thanks for keeping me honest. Let me try to rephrase---in response to any text, you can write ~arbitrarily many words in reply that lay out exactly where it was wrong. You can also write ~arbitrarily many words in reply that lay out where it was right. You can vary not only the quantity but the stridency/emphasis of these collections of words. (I'm only talking simulacrum-0 stuff here.) This is no canonical weighting of these!! You have to choose. The choice is not determined by your commitment to speaking truth. The choice is determined by priorities about how your words move others' minds and move the world. Does that make more sense? 

'Speak only truth' is underconstrained; we've allowed ourselves to add (charitably) 'and speak all the truth that your fingers have the strength to type, particularly on topics about which there appears to be disagreement' or (uncharitably) 'and cultivate the aesthetic of a discerning, cantankerous, genius critic' in order to get lower-dimensional solutions. 

When constraints don't eliminate all dimensions, I think you can reasonably have lexically ordered preferences. We've picked a good first priority (speak only truth), but have picked a counterproductive second priority ([however you want to describe it]). I claim our second priority should be something like "and accomplish your goals." Where your goals, presumably, = survive.

Reply
The title is reasonable
MalcolmMcLeod1mo30

What would it take for you to commission such a poll? If it's funding, please post about how much funding would be required; I might be able to arrange it. If it's something else... well, I still would really like this poll to happen, and so would many others (I reckon). This is a brilliant idea that had never occurred to me. 

Reply
Buck's Shortform
MalcolmMcLeod1mo-23

The general pattern from Anthropic leadership is eliding entirely the possibility of Not Building The Thing Right Now. From that baseline, I commend Zach for at least admitting that's a possibility. Outright, it's disappointing that he can't see the path of Don't Build It Right Now---And Then Build It Later, Correctly, or can't acknowledge its existence. He also doesn't really net benefits and costs. He just does the "Wow! There sure are two sides. We should do good stuff" shtick. Which is better than much of Dario's rhetoric! He's cherrypicked a low p(doom) estimate, but I appreciate his acknowledgement that "Most of us wouldn't be willing to risk a 3% chance (or even a 0.3% chance!) of the people we love dying." Correct! I am not willing to! "But accepting uncertainty matters for navigating this complex challenge thoughtfully." Yes. I have accepted my uncertainty of my loved ones' survival, and I have been thoughtful, and the conclusion I have come to is that I'm not willing to take that risk. 

Tbc this is still a positive update for me on Anthropic's leadership. To a catastrophically low level. Which is still higher than all other lab leaders.

But it reminds me of this world-class tweet, from @humanharlan, whom you should all follow. he's like if roon weren't misaligned:


"At one extreme: ASI, if not delayed, will very likely cause our extinction. Let’s try to delay it.

On the other: No chance it will do that. Don’t try to delay it.

Nuanced, moderate take: ASI, if not delayed, is moderately likely to cause our extinction. Don’t try to delay it."

Reply
Load More