We are not living in a simulation

dfranke

Refusal in LLMs is mediated by a single direction

179

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda

Ω 708d

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2920 more words)

wassname1h20

If anyone wants to try this on llama-3 7b, I converted the collab to baukit, and it's available here.

Some Experiments I'd Like Someone To Try With An Amnesic

johnswentworth

11h

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnesic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnesic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnesic, it’s also apparently one...

(See More – 589 more words)

8RedMan2h

O man, wait until you discover nmda antagonists and anti-cholinergics. There are trip reports on erowid from people who took drugs with amnesia as a side effect so...happy reading I guess? I'm going to summarize this post with "Can one of you take an online IQ test after dropping a ton of benzos and report back? Please do this several times, for science." Not the stupidest or most harmful 'lets get high and...' suggestion, but I can absolutely assure you that if trying this leads you into the care of a medical or law enforcement professional, they will likely say something to the effect of 'so the test told you that you were retarded right?' In response to this, you, with bright naive eyes, should say 'HOW DID YOU KNOW?!' as earnestly as you can. You might be able to make a run for it while they're laughing.

the gears to ascension2h20

For those who don't get the joke: benzos are depressants, and will (temporarily) significantly reduce your cognitive function if you take enough to have amnesia.

this might not make john's idea pointless, if the tested interventions's effect on cognitive performance still correlates strongly with sober performance. but there may be some interventions whose main effect is to offset benzos effects whose usefulness does not generalize to sober.

2tailcalled2h

I think this is a really interesting idea, but I'm not comfortable enough with drugs to test it myself. If anyone is doing this and wants psychometric advice, though, I am offering to join your project.

2tailcalled3h

I think the proposed method could still work though. A substantial fraction of the pseudorandomness may be consistent on the individual person level. The type of pseudorandomness you describe here ought to be independent at the level of individual items, so it ought to be part of the least-reliable variance component (not part of the general trait measured and not stable over time). It's possible to use statistics to estimate how big an effect it has on the scores, and it's possible to drive it arbitrarily far down in effect simply by making the test longer.

Thomas Kwa's Shortform

Thomas Kwa

Ω 04y

4Thomas Kwa3h

I talked about this with Lawrence, and we both agree on the following: * There are mathematical models under which you should update >=1% in most weeks, and models under which you don't. * Brownian motion gives you 1% updates in most weeks. In many variants, like stationary processes with skew, stationary processes with moderately heavy tails, or Brownian motion interspersed with big 10%-update events that constitute <50% of your variance, you still have many weeks with 1% updates. Lawrence's model where you have no evidence until either AI takeover happens or 10 years passes does not give you 1% updates in most weeks, but this model is almost never the case for sufficiently smart agents. * Superforecasters empirically make lots of little updates, and rounding off their probabilities to larger infrequent updates make their forecasts on near-term problems worse. * Thomas thinks that AI is the kind of thing where you can make lots of reasonable small updates frequently. Lawrence is unsure if this is the state that most people should be in, but it seems plausibly true for some people who learn a lot of new things about AI in the average week (especially if you're very good at forecasting). * In practice, humans often update in larger discrete chunks. Part of this is because they only consciously think about new information required to generate new numbers once in a while, and part of this is because humans have emotional fluctuations which we don't include in our reported p(doom). * Making 1% updates in most weeks is not always just irrational emotional fluctuations; it is consistent with how a rational agent would behave under reasonable assumptions. However, we do not recommend that people consciously try to make 1% updates every week, because fixating on individual news articles is not the right way to think about forecasting questions, and it is empirically better to just think about the problem directly rather than obsessing about how many updates you're m

2JBlack4h

It definitely should not move by anything like a Brownian motion process. At the very least it should be bursty and updates should be expected to be very non-uniform in magnitude. In practice, you should not consciously update very often since almost all updates will be of insignificant magnitude on near-irrelevant information. I expect that much of the credence weight turns on unknown unknowns, which can't really be updated on at all until something turns them into (at least) known unknowns. But sure, if you were a superintelligence with practically unbounded rationality then you might in principle update very frequently.

Thomas Kwa2h20

The Brownian motion assumption is rather strong but not required for the conclusion. Consider the stock market, which famously has heavy-tailed, bursty returns. It happens all the time for the S&P 500 to move 1% in a week, but a 10% move in a week only happens a couple of times per decade. I would guess (and we can check) that most weeks have >0.6x of the average per-week variance of the market, which causes the median weekly absolute return to be well over half of what it would be if the market were Brownian motion with the same long-term variance.

Also, Lawrence tells me that in Tetlock's studies, superforecasters tend to make updates of 1-2% every week, which actually improves their accuracy.

2TsviBT7h

Probabilities on summary events like this are mostly pretty pointless. You're throwing together a bunch of different questions, about which you have very different knowledge states (including how much and how often you should update about them).

My hour of memoryless lucidity

180

Eric Neyman

This is a linkpost for https://ericneyman.wordpress.com/2024/05/04/my-hour-of-memoryless-lucidity/

Yesterday, I had a coronectomy: the top halves of my bottom wisdom teeth were surgically removed. It was my first time being sedated, and I didn’t know what to expect. While I was unconscious during the surgery, the hour after surgery turned out to be a fascinating experience, because I was completely lucid but had almost zero short-term memory.

My girlfriend, who had kindly agreed to accompany me to the surgery, was with me during that hour. And so — apparently against the advice of the nurses — I spent that whole hour talking to her and asking her questions.

The biggest reason I find my experience fascinating is that it has mostly answered a question that I’ve had about myself for quite a long time: how deterministic am...

(Continue Reading – 1467 more words)

RedMan2h10

Clive Wearing's story might be interesting to you: https://m.youtube.com/watch?v=k_P7Y0-wgos&feature=youtu.be

5Viliam11h

It could be an interesting experiment to build up this list iteratively. Like, every question you ask for the third time, the answer gets added at the bottom of the list. How long will the list get, and what will it contain?

1ErioirE9h

Yes, but it thankfully for me only lasted a couple of hours and they didn't start keeping track until near the end.

17johnswentworth11h

My answer.

Now THIS is forecasting: understanding Epoch’s Direct Approach

Elliot_Mckernon, Zershaaneh Qureshi

21h

Happy May the 4th from Convergence Analysis! Cross-posted on the EA Forum.

As part of Convergence Analysis’s scenario research, we’ve been looking into how AI organisations, experts, and forecasters make predictions about the future of AI. In February 2023, the AI research institute Epoch published a report in which its authors use neural scaling laws to make quantitative predictions about when AI will reach human-level performance and become transformative. The report has a corresponding blog post, an interactive model, and a Python notebook.

We found this approach really interesting, but also hard to understand intuitively. While trying to follow how the authors derive a forecast from their assumptions, we wrote a breakdown that may be useful to others thinking about AI timelines and forecasting.

In what follows, we set out our interpretation of...

(Continue Reading – 5526 more words)

Ruby3h20

The title is strong with this one. I like it.

11Chris_Leong15h

Just?

8NunoSempere19h

You might also enjoy this review: https://nunosempere.com/blog/2023/04/28/expert-review-epoch-direct-approach/

adamzerner's Shortform

Adam Zerner

Adam Zerner3h133

I wish there were more discussion posts on LessWrong.

Right now it feels like it weakly if not moderately violates some sort of cultural norm to publish a discussion post (similar but to a lesser extent on the Shortform). Something low effort of the form "X is a topic I'd like to discuss. A, B and C are a few initial thoughts I have about it. Thoughts?"

It seems to me like something we should encourage though. Here's how I'm thinking about it. Such "discussion posts" currently happen informally in social circles. Maybe you'll text a friend. Maybe you'll brin... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Introducing AI-Powered Audiobooks of Rational Fiction Classics

Askwho

16h

(ElevenLabs reading of this post:)

I'm excited to share a project I've been working on that I think many in the Lesswrong community will appreciate - converting some rational fiction into high-quality audiobooks using cutting-edge AI voice technology from ElevenLabs, under the name "Askwho Casts AI".

The keystone of this project is an audiobook version of Planecrash (AKA Project Lawful), the epic glowfic authored by Eliezer Yudkowsky and Lintamande. Given the scope and scale of this work, with its large cast of characters, I'm using ElevenLabs to give each character their own distinct voice. It's a labor of love to convert this audiobook version of this story, and I hope if anyone has bounced off it before, this...

(See More – 140 more words)

Askwho4h10

Thanks! Glad you are enjoying it.

1Askwho4h

Thanks, appreciate it.

2Askwho4h

It is not cheap. It's around ~$20 per hour of audio. Luckily there are people on bord with this project who help cover cost through a Patreon

1Askwho4h

Thanks so much! Glad you are enjoying the audio format. I really agree this story is worth "reading" in some form, it's why I'm working on this project.

Thoughts on seed oil

290

dynomight

15d

This is a linkpost for https://dynomight.net/seed-oil/

A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack:

“When are you going to write about seed oils?”

“Did you know that seed oils are why there’s so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?”

“Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?”

“Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?”

He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it’s critical that we overturn our lives...

(Continue Reading – 4926 more words)

Freyja5h10

I suspect the word 'pre-prepared' is doing a lot of the heavy lifting here--when I see that item on the list I think things like pre-fried chicken, frozen burger patties, veggie pakora, veggies in a sauce for a stir-fry, stuff like that (like you'd find in a ready-made frozen meal). Not like, frozen peas.

2Said Achmiz9h

Unless you freeze it. This is by far the best way of consistently having not-ultra-processed bread that tastes fresh and delicious, without having to eat a whole loaf every day or throwing away most of it. EDIT: This also works for various sorts of buns, rolls, panettone, etc.

3David Cato18h

Next time I have a chance to pick up Kirkland olive oil I'll give it a try and report back. I made a decision around this time of dietary changes to stop trying to cut so many corners wtih food. As a calorie dense food, even paying an "outrageous" double or triple the cost of cheap olive oil barely dents the budget on a cost per calorie basis. And speaking of budgeting, I had mental resistance to spending more on food so now I guesstimate what percent of my food budget I spend over the "cheapest equivalent alternative" part and I label as "preventative healthcare".

2JenniferRM10h

I look forward to your reply! (And regarding "food cost psychology" this is an area where I think Neo Stoic objectivity is helpful. Rich people can pick up a lot of hedons just from noticing how good their food is, and formerly poor people have a valuable opportunity to re-calibrate. There are large differences in diet between socio-economic classes still, and until all such differences are expressions of voluntary preference, and "dietary price sensitivity has basically evaporated", I won't consider the world to be post-scarcity. Each time I eat steak, I can't help but remember being asked in Summer Camp as a little kid, after someone ask "if my family was rich" and I didn't know, about this... like the very first "objective calibrating response" accessible to us as children was the rate of my family's steak consumption. Having grown up in some amount of poverty, I often see "newly rich people" eating as if their health is not the price of slightly more expensive food, or their health is "not worth avoiding the terrible terrible sin of throwing food in the garbage (which my aunt who lived through the Great Depression in Germany yelled at me, once, with great feeling, for doing, when I was child and had eaten less than ALL the birthday cake that had been put on my plate)". Cultural norms around food are fascinating and, in my opinion, are often rewarding to think about.)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Executive summary

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA