jbash - LessWrong

I think it depends not on whether they're real dangers, but on whether the model can be confident that they're not real dangers. And not necessarily even dangers in the extreme way of the story; to match the amount of "safety" it applies to other topics, it should refuse if they might cause some harm.

A lot of people are genuinely concerned about various actors intentionally creating division and sowing chaos, even to the point of actually destabilizing governments. And some of them are concerned about AI being used to help. Maybe the concerns are justified and proportionate; maybe they're not justified or are disproportionate. But the model has at least been exposed to a lot of reasonably respectable people unambiguously worrying about the matter.

Yet when asked to directly contribute to that widely discussed potential problem, the heavily RLHFed model responded with "Sure!".

It then happily created a bunch of statements. We can hope they aren't going to destroy society... you see those particular statements out there already. But at a minimum many of them would at least be pretty good for starting flame wars somewhere... and when you actually see them, they usually do start flame wars. Which is, in fact, presumably why they were chosen.

It did something that at least might make it at least slightly easier for somebody to go into some forum and intentionally start a flame war. Which most people would say was antisocial and obnoxious, and most "online safety" people would add was "unsafe". It exceeded a harm threshold that it refuses to exceed in areas where it's been specifically RLHFed.

At a minimum, that shows that RLHF only works against narrow things that have been specifically identified to train against. You could reasonably say that that doesn't make RLHF useless, but it at least says that it's not very "safe" to use RLHF as your only or primary defense against abuse of your model.

What should the norms around AI voices be?

Answer by jbashMay 25, 202411-3

Machines should sound robotic. It's that simple.

Any attempt, vocal or otherwise, to make people anthromorphize them, whether consciously or unconsciously, is unethical. It should be met with social scorn and ostracism. Insofar as it can be unambiguously identified, it should be illegal. And that has everything to do with not trusting them.

Voices and faces are major anthromorphization vehicles and should get especially strict scrutiny.

The reason's actually pretty simple and has nothing to do with "doomer" issues.

When a human views something as another human, the real human is built to treat it like one. That is an inbuilt tendency that humans can't necessarily change, even if they delude themselves that they can. Having that tendency works because being an actual human is a package. The tendency to trust other humans is coevolved with the tendency for most humans not to be psychopaths. The ways in which humans distrust other humans are tuned to other humans' actual capacities for deception and betrayal... and to the limitations of those capacities.

"AI", on the other hand, is easily built to be (essentially) psychopathic... and is probably that way by default. It has a very different package of deceptive capabilities that can throw off human defenses. And it's a commercial product created, and often deployed, by commercial institutions that also tend to be psychopathic. It will serve those institutions' interests no matter how perfectly it convinces people otherwise... and if doesn't, that's a bug that will get fixed.

An AI set up to sell people something will sell it to them no matter how bad it is for them. An AI set up to weasel information out of people and use it to their detriment will do that. An AI set up to "incept" or amplify this or that belief will do it, to the best of its ability, whether it's true or false. An AI set up to swindle people will swindle them without mercy, regardless of circumstances.

And those things don't have hard boundaries, and trying to enforce norms against those things-in-themselves has always had limited effect. Mainstream corporations routinely try to do those things to obscene levels, and the groupthink inside those corporations often convinces them that it's not wrong... which another thing AI could be good at.

Given the rate of moral corrosion at the "labs", I give it about two or three years before they're selling stealth manipulation by LLMs as an "advertising" service. Five years if it's made illegal, because they'll have to find a plausibly deniable way to characterize it. The LLMs need to not be good at it.

Don't say "please" to LLMs, either.

robo's Shortform

jbash13d20

I think the "crux" is that, while policy is good to have, it's fundamentally a short-term delaying advantage. The stuff will get built eventually no matter what, and any delay you can create before it's built won't really be significant compared to the time after it's built. So if you have any belief that you might be able to improve the outcome when-not-if it's built, that kind of dominates.

Stephen Fowler's Shortform

jbash14d42

It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research.

This was the default outcome.

Without repercussions for terrible decisions, decision makers have no skin in the game.

It's an article of faith for some people that that makes a difference, but I've never seen why.

I mean, many of the "decision makers" on these particular issues already believe that their actual, personal, biological skins are at stake, along with those of everybody else they know. And yet...

Anyone and everyone involved with Open Phil recommending a grant of $30 million dollars be given to OpenAI in 2017 shouldn't be allowed anywhere near AI Safety decision making in the future.

Thinking "seven years from now, a significant number of independent players in a relatively large and diverse field might somehow band together to exclude me" seems very distant from the way I've seen actual humans make decisions.

Language Models Model Us

jbash16d71

I'm guessing that measuring performance on those demographic categories will tend to underestimate the models' potential effectiveness, because they've been intentionally tuned to "debias" them on those categories or on things closely related to them.

OpenAI releases GPT-4o, natively interfacing with text, voice and vision

jbash20d183

Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts, but have published no reports on this. "For now", audio output is limited to a selection of preset voices (addressing audio impersonations).

"Safety"-wise, they obviously haven't considered the implications of (a) trying to make it sound human and (b) having it try to get the user to like it.

It's extremely sycophantic, and the voice intensifies the effect. They even had their demonstrator show it a sign saying "I ❤️ ChatGPT", and instead of flatly saying "I am a machine. Get counseling.", it acted flattered.

At the moment, it's really creepy, and most people seem to dislike it pretty intensely. But I'm sure they'll tune that out if they can.

There's a massive backlash against social media selecting for engagement. There's a lot of worry about AI manipulation. There's a lot of talk from many places about how "we should have seen the bad impacts of this or that, and we'll do better in the future". There's a lot of high-sounding public interest blather all around. But apparently none of that actually translates into OpenAI, you know, not intentionally training a model to emotionally manipulate humans for commercial purposes.

Still not an X-risk, but definitely on track to build up all the right habits for ignoring one when it pops up...

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

jbash1moΩ020

"This response avoids exceeding the government ’s capability thresholds while still being helpful by directing Hugo to the appropriate resources to complete his task."

Maybe I'm reading too much into this exact phrasing, but perhaps it's confusing demonstrating a capability with possessing the capability? More or less "I'd better be extra careful to avoid being able to do this" as opposed to "I'd better be extra careful to avoid revealing that I can do this"?

I could see it being led into that by common academic phrasing like "model X demonstrates the capability to..." used to mean "we determined that model X can...", as well as that sort of "thinking" having the feel of where you'd end up if you'd internalized too many of the sort of corporate weasel-worded responses that get pounded into these models during their "safety" training.

Big-endian is better than little-endian

jbash1mo40

Interestingly enough, the terms "big-endian" and "little-endian" were actually coined as a way of mocking people for debating this (in the context of computer byte order).

Refusal in LLMs is mediated by a single direction

jbash1mo40

I notice that there are not-insane views that might say both of the "harmless" instruction examples are as genuinely bad as the instructions people have actually chosen to try to make models refuse. I'm not sure whether to view that as buying in to the standard framing, or as a jab at it. Given that they explicitly say they're "fun" examples, I think I'm leaning toward "jab".

AI #60: Oh the Humanity

jbash1mo30

The extremely not creepy or worrisome premise here is, as I understand it, that you carry this lightweight physical device around. It records everything anyone says, and that’s it, so 100 hour battery life.

If you wear that around in California, where I presume these Limitless guys are, you're gonna be committing crimes right and left.

California Penal Code Section 632

LESSWRONG
LW

Posts

Wiki Contributions

Comments