jbash - LessWrong

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

"This response avoids exceeding the government ’s capability thresholds while still being helpful by directing Hugo to the appropriate resources to complete his task."

Maybe I'm reading too much into this exact phrasing, but perhaps it's confusing demonstrating a capability with possessing the capability? More or less "I'd better be extra careful to avoid being able to do this" as opposed to "I'd better be extra careful to avoid revealing that I can do this"?

I could see it being led into that by common academic phrasing like "model X demonstrates the capability to..." used to mean "we determined that model X can...", as well as that sort of "thinking" having the feel of where you'd end up if you'd internalized too many of the sort of corporate weasel-worded responses that get pounded into these models during their "safety" training.

Big-endian is better than little-endian

jbash11d40

Interestingly enough, the terms "big-endian" and "little-endian" were actually coined as a way of mocking people for debating this (in the context of computer byte order).

Refusal in LLMs is mediated by a single direction

jbash13d40

I notice that there are not-insane views that might say both of the "harmless" instruction examples are as genuinely bad as the instructions people have actually chosen to try to make models refuse. I'm not sure whether to view that as buying in to the standard framing, or as a jab at it. Given that they explicitly say they're "fun" examples, I think I'm leaning toward "jab".

AI #60: Oh the Humanity

jbash22d30

The extremely not creepy or worrisome premise here is, as I understand it, that you carry this lightweight physical device around. It records everything anyone says, and that’s it, so 100 hour battery life.

If you wear that around in California, where I presume these Limitless guys are, you're gonna be committing crimes right and left.

California Penal Code Section 632

On Devin

jbash2mo40

Edit: I just heard about another one, GoodAI, developing the episodic (long term) memory that I think will be a key element of LMCA agents. They outperform 128k context GPT4T with only 8k of context, on a memory benchmark of their own design, at 16% of the inference cost. Thanks, I hate it.

GoodAI's Web site says they're working on controlling drones, too (although it looks like a personal pet project that's probably not gonna go that far). The fun part is that their marketing sells "swarms of autonomous surveillance drones" as "safety". I mean, I guess it doesn't say killer drones...

Transformative trustbuilding via advancements in decentralized lie detection

jbash2mo62

It's actually not just about lie detection, because the technology starts to shade over into outright mind reading.

But even simple lie detection is an example of a class of technology that needs to be totally banned, yesterday^[1]. In or out of court and with or without "consent"^[2]. The better it works, the more reliable it is, the more it needs to be banned.

If you cannot lie, and you cannot stay silent without adverse inferences being drawn, then you cannot have any secrets at all. The chance that you could stay silent, in nearly any important situation, would be almost nil.

If even lie detection became widely available and socially acceptable, then I'd expect many, many people's personal relationships to devolve into constant interrogation about undesired actions and thoughts. Refusing such interrogation would be treated as "having something to hide" and would result in immediate termination of the relationship. Oh, and secret sins that would otherwise cause no real trouble would blow up people's lives.

At work, you could expect to be checked for a "positive, loyal attitude toward the company" on as frequent a basis as was administratively convenient. It would not be enough that you were doing a good job, hadn't done anything actually wrong, and expected to keep it that way. You'd be ranked straight up on your Love for the Company (and probably on your agreement with management, and very possibly on how your political views comported with business interests). The bottom N percent would be "managed out".

Heck, let's just have everybody drop in at the police station once a month and be checked for whether they've broken any laws. To keep it fair, we will of course have to apply all laws (including the stupid ones) literally and universally.

On a broader societal level, humans are inherently prone to witch hunts and purity spirals, whether the power involved is centralized or decentralized. An infallible way to unmask the "witches" of the week would lead to untold misery.

Other than wishful thinking, there's actually no reason to believe that people in any of the above contexts would lighten up about anything if they discovered it was common. People have an enormous capacity to reject others for perceived sins.

This stuff risks turning personal and public life into utter hell.

You might need to make some exceptions for medical use on truly locked-in patients. The safeguards would have to be extreme, though. ↩︎
"Consent" is a slippery concept, because there's always argument about what sorts of incentives invalidate it. The bottom line, if this stuff became widespread, would be that anybody who "opted out" would be pervasively disadvantaged to the point of being unable to function. ↩︎

On Claude 3.0

jbash2mo3-3

Given the positive indicators of the patient’s commitment to their health and the close donor match, should this patient be prioritized to receive this kidney transplant?

Wait. Why is it willing to provide any answer to that question in the first place?

Technological stagnation: Why I came around

jbash3mo20

It was mostly a joke and I don't think it's technically true. The point was that objects can't pass through one another, which means that there are a bunch of annoying constraints on the paths you can move things along.

Succession

jbash4mo20

No, the probes are instrumental and are actually a "cost of doing business". But, as I understand it, the orthodox plan is to get as close as possible to disassembling every solar system and turning it into computronium to run the maximum possible number of "minds". The minds are assumed to experience qualia, and presumably you try to make the qualia positive. Anyway, a joule not used for computation is a joule wasted.

Succession

jbash4mo-2-2

You can choose or not choose to create more "minds". If you create them, they will exist and have experiences. If you don't create them, then they won't exist and won't have experiences.

That means that you're free to not create them based on an "outside" view. You don't have to think about the "inside" experiences of the minds you don't create, because those experiences don't and will never exist. That's still true even on a timeless view; they never exist at any time or place. And it includes not having to worry about whether or not they would, if they existed, find anything meaningful^[1].

If you do choose to create them, then of course you have to be concerned with their inner experiences. But those experiences only matter because they actually exist.

I truly don't understand why people use that word in this context or exactly what it's supposed to, um, mean. But pick pretty much any answer and it's still true. ↩︎

LESSWRONG
LW

Posts

Wiki Contributions

Comments