thenoviceoof — LessWrong

An Attempt to Explain my AI Risk Explainer Attempt

While reading your comment I realized I'm not exactly sure why conversations seem to be exempt; choices constantly need to be made, which implies huge costs. However, many people find conversations only mildly draining, or even invigorating. There's probably some evolutionary component?

A different way to describe the foldout format is as a one-sided conversation, but instead of a monologue (the model of normal writing), the reader/listener has one available action, to request clarification on specific points. This obviously isn't enough to move the format into a less draining conversational domain, but understanding why not might point to ways it could be moved there.

I did mention chatbots in the post and rejected them for not having enough structure, but maybe one could impose additional structure onto a conversation. For example, lectures have structure, and a one-on-one lecture could be stopped at any point to ask for clarification or contest points. Unclear how to extend something like RAG to that point, though, or if it would be helpful; it's not like most students love listening to lectures.

Hmm! These do seem like leads, but ones that unfortunately seem to require a ton of work to chase down.

Visionary arrogance and a criticism of LessWrong voting

thenoviceoof2mo21

To be clear, I didn't downvote you: I did think "hmm, wasn't there a recent big discussion around downvote-without-commenting norms which didn't result in any changes?" and went and found it. I can see why you'd think I did downvote you; you specifically requested it! (Well, requested `if downvote then comment`)

Visionary arrogance and a criticism of LessWrong voting

thenoviceoof2mo85

You may be interested in a very similar discussion from several months ago: When you downvote, explain why.

Community Feedback Request: AI Safety Intro for General Public

thenoviceoof6mo10

I was recently experimenting in extreme amounts of folding (LW linkpost): I'd be interested to hear from Chris whether he thinks this is too much folding?

[Linkpost] AI War seems unlikely to prevent AI Doom

thenoviceoof6mo10

Hmm, "AI war makes s-risks more likely" seems plausible, but compared to what? If we were given a divine choice was between a non-aligned/aligned AI war, or a suffering-oriented singleton, wouldn't we choose the war? Maybe more likely relative to median/mean scenarios, but that seems hard to pin down.

Hmm, I thought I put a reference to the DoD's current Replicator Initiative into the post, but I can't find it: I must have moved it out? Still, yes, we're moving towards automated war fighting capability.

[Linkpost] AI War seems unlikely to prevent AI Doom

thenoviceoof6mo10

The post setup skips the "AIs are loyal to you" bit, but it does seem like this line of thought broadly aligns with the post.

I do think this does not require ASI, but I would agree that including it certainly doesn't help.

deleted

thenoviceoof8mo10

Some logical nits:

Early on you mention physical attacks to destroy offline backups; these attacks would be highly visible and would contradict the dark forest nature the scenario.
Perfect concealment and perfect attacks are in tension. The AI supposedly knows the structure and vulnerabilities of the systems hosting an enemy AI, but finding these things out for sure requires intrusion, which can be detected. The AI can hold off on attacking and work off of suppositions, but then the perfect attack is not guaranteed, and the attack could fail due to unknowns.

Other notes:

Why do you assume that AIs bias towards perfect, deniable strikes? An AI that strikes first can secure an early advantage; for example, if it can knock out all running copies of an enemy AI, restoring from backups will take time and leave the enemy AI vulnerable. As another example, if AI Alpha knows it is less capable than AI Bravo, but that AI Bravo will wait to attack it perfectly, AI Alpha attacking first (imperfectly) can force AI Bravo to abandon all previous attack preparations to defend itself (see maneuver warfare).
- "Defend itself" might be better put as re-taking and re-securing compromised systems; relatedly, I think cybersecurity defense is much less of an active action than this analysis seems to assume?
An extension of your game theory analysis implies that the US should have nuked the USSR in the 1950s, and should have been nuking all other nuclear nations over the last 70 years. This seems weird? At least, I expect it to not be persuasive to folks thinking about AI society.
The stylistic choice I disagree with most is the bolding: if a short paragraph has 5 different bolded statements, then... what's the point?

The Dissolution of AI Safety

thenoviceoof11mo32

Let's say there's a illiterate man that lives a simple life, and in doing so just happens to follow all the strictures of the law, without ever being able to explain what the law is. Would you say that this man understands the law?

Alternatively, let's say there is a learned man that exhaustively studies the law, but only so he can bribe and steal and arson his way to as much crime as possible. Would you say that this man understands the law?

I would say that it is ambiguous whether the 1st man understands the law; maybe? kind of? you could make an argument I guess? it's a bit of a weird way to put it innit? Whereas the 2nd man definitely understands the law. It sounds like you would say that the 1st man definitely understands the law (I'm not sure what you would say about the 2nd man), which might be where we have a difference.

I think you could say that LLMs don't work that way, that the reader should intuitively know this and that the word "understanding" should be treated as being special in this context and should not be ambiguous at all; as I reader, I am saying I am confused by the choice of words, or at least this is not explained in enough detail ahead of time.

Obviously, I'm just one reader, maybe everyone else understood what you meant; grain of salt, and all that.

The Dissolution of AI Safety

thenoviceoof11mo10

This makes much more sense: when I was reading from your post lines like "[LLMs] understand human values and ethics at a human level", this is easy to read as "because LLMs can output an essay on ethics, those LLMs will not do bad things". I hope you understand why I was confused; maybe you should swap "understand ethics" for something like "follow ethics"/"display ethical behavior"? And maybe try not to stick a mention of "human uploads" (which presumably do have real understanding) right before this discussion?

And responding to your clarification, I expect that old school AI safetyists would agree that an LLM that consistently reflects human value judgments to be aligned (and I would also agree!), but they would say #1 this has not happened yet (for a recent incident, this hardly seems aligned; I think you can argue that this particular case was manipulated, that jailbreaks in general don't matter, or that these sorts of breaks are infrequent enough they don't matter, but I think this obvious class of rejoinder deserves some sort of response) #2 consistency seems unlikely to happen (like MondSemmel makes a case for in a sibling comment).

The Dissolution of AI Safety

thenoviceoof11mo90

I'd agree that the arguments I raise could be addressed (as endless arguments attest) and OP could reasonably end up with a thesis like "LLMs are actually human aligned by default". Putting my recommendation differently, the lack of even a gesture towards those arguments almost caused me to dismiss the post as unserious and not worth finishing.

I'm somewhat surprised, given OP's long LW tenure. Maybe this was written for a very different audience and just incidentally posted to LW? Except the linkpost tagline focuses on the 1st part of the post, not the 2nd, implying OP thought this was actually persuasive?! Is OP failing an intellectual Turing test or am I???

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments