2mo

yanni kyriacos22m10

Something I'm confused about: what is the threshold that needs meeting for the majority of people in the EA community to say something like "it would be better if EAs didn't work at OpenAI"?

Imagining the following hypothetical scenarios over 2024/25, I can't predict confidently whether they'd individually cause that response within EA?

Ten-fifteen more OpenAI staff quit for varied and unclear reasons. No public info is gained outside of rumours
There is another board shakeup because senior leaders seem worried about Altman. Altman stays on
Superalignment team

... (read more)

1yanni kyriacos9h

"alignment researchers are found to score significantly higher in liberty (U=16035, p≈0)" This partly explains why so much of the alignment community doesn't support PauseAI! "Liberty: Prioritizes individual freedom and autonomy, resisting excessive governmental control and supporting the right to personal wealth. Lower scores may be more accepting of government intervention, while higher scores champion personal freedom and autonomy..." https://forum.effectivealtruism.org/posts/eToqPAyB4GxDBrrrf/key-takeaways-from-our-ea-and-alignment-research-surveys#comments

Some Experiments I'd Like Someone To Try With An Amnesic

johnswentworth

13h

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnesic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnesic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnesic, it’s also apparently one...

(See More – 589 more words)

Michael Roe37m10

This sounds like a terrible idea.

Though, if you're going to be put under sedation in hospital for some legit medical reason, you could have in mind a cool experiment to try when you're coming around in the recovery room.

i was sedated for endoscopy about 10 years ago,

they tell you not to drive afterwards (really, don't try and drive afterwards)

and to have a friend with you for the rest of the day to look after you

i was somewhat impaired for the rest of the day (like, even trying to cook a meal was difficult and potentially risky ... e.g. b... (read more)

10RedMan5h

O man, wait until you discover nmda antagonists and anti-cholinergics. There are trip reports on erowid from people who took drugs with amnesia as a side effect so...happy reading I guess? I'm going to summarize this post with "Can one of you take an online IQ test after dropping a ton of benzos and report back? Please do this several times, for science." Not the stupidest or most harmful 'lets get high and...' suggestion, but I can absolutely assure you that if trying this leads you into the care of a medical or law enforcement professional, they will likely say something to the effect of 'so the test told you that you were retarded right?' In response to this, you, with bright naive eyes, should say 'HOW DID YOU KNOW?!' as earnestly as you can. You might be able to make a run for it while they're laughing.

6the gears to ascension4h

For those who don't get the joke: benzos are depressants, and will (temporarily) significantly reduce your cognitive function if you take enough to have amnesia. this might not make john's idea pointless, if the tested interventions's effect on cognitive performance still correlates strongly with sober performance. but there may be some interventions whose main effect is to offset benzos effects whose usefulness does not generalize to sober.

2tailcalled5h

I think this is a really interesting idea, but I'm not comfortable enough with drugs to test it myself. If anyone is doing this and wants psychometric advice, though, I am offering to join your project.

Ironing Out the Squiggles

140

Zack_M_Davis

Adversarial Examples: A Problem

The apparent successes of the deep learning revolution conceal a dark underbelly. It may seem that we now know how to get computers to (say) check whether a photo is of a bird, but this façade of seemingly good performance is belied by the existence of adversarial examples—specially prepared data that looks ordinary to humans, but is seen radically differently by machine learning models.

The differentiable nature of neural networks, which make them possible to be trained at all, are also responsible for their downfall at the hands of an adversary. Deep learning models are fit using stochastic gradient descent (SGD) to approximate the function between expected inputs and outputs. Given an input, an expected output, and a loss function (which measures "how bad" it...

(Continue Reading – 3116 more words)

faul_sname2h20

Lots of food for thought here, I've got some responses brewing but it might be a little bit.

On precise out-of-context steering

Olli Järviniemi

Meta: I'm writing this in the spirit of sharing negative results, even if they are uninteresting. I'll be brief. Thanks to Aaron Scher for lots of conversations on the topic.

Summary

Problem statement

You are given a sequence of 100 random digits. Your aim is to come up with a short prompt that causes an LLM to output this string of 100 digits verbatim.

To do so, you are allowed to fine-tune the model beforehand. There is a restriction, however, on the fine-tuning examples you may use: no example may contain more than 50 digits.

Results

I spent a few hours with GPT-3.5 and did not get a satisfactory solution. I found this problem harder than I initially expected it to be.

Setup

The question motivating this post's setup is: can you do precise steering...

(See More – 532 more words)

2faul_sname10h

One fine-tuning format for this I'd be interested to see is This on the hypothesis that it's bad at counting digits but good at continuing a known sequence until a recognized stop pattern (and the spaces between digits on the hypothesis that the tokenizer makes life harder than it needs to be here)

faul_sname2h20

Ok, the "got to try this" bug bit me, and I was able to get this mostly working. More specifically, I got something that is semi-consistently able to provide 90+ digits of mostly-correct sequence while having been trained on examples with a maximum consecutive span of 40 digits and no more than 48 total digits per training example. I wasn't able to get a fine-tuned model to reliably output the correct digits of the trained sequence, but that mostly seems to be due to 3 epochs not being enough for it to learn the sequence.

Model was trained on 1000 examples ... (read more)

Which skincare products are evidence-based?

Vanessa Kosoy, rosiecam

The beauty industry offers a large variety of skincare products (marketed mostly at women), differing both in alleged function and (substantially) in price. However, it's pretty hard to test for yourself how much any of these product help. The feedback loop for things like "getting less wrinkles" is very long.

So, which of these products are actually useful and which are mostly a waste of money? Are more expensive products actually better or just have better branding? How can I find out?

I would guess that sunscreen is definitely helpful, and using some moisturizers for face and body is probably helpful. But, what about night cream? Eye cream? So-called "anti-aging"? Exfoliants?

Vanessa Kosoy2h20

Thanks for this!

Does it really make sense to see a dermatologist for this? I don't have any particular problem I am trying to fix other than "being a woman in her 40s (and contemplating the prospect of her 50s, 60s etc with dread)". Also, do you expect the dermatologist to give better advice than people in this thread or the resources they linked? (Although, the dermatologist might be better familiar with specific products available in my country.)

1FinalFormal26h

I watched this video and this is what I bought maximizing for cost/effectiveness, rate my stack: * Moisturizer * Retinol * Sunscreen

4jmh11h

Just wondering if you could expand on just what improvements you see? What features or criteria are you looking at and how you have been measuring the changes?

2rosiecam15h

Very helpful, thank you for the extra detail!

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

LessWrong's (first) album: I Have Been A Good Bing

529

habryka, kave

1mo

tl;dr: LessWrong released an album! Listen to it now on Spotify, YouTube, YouTube Music, or Apple Music.

On April 1st 2024, the LessWrong team released an album using the then-most-recent AI music generation systems. All the music is fully AI-generated, and the lyrics are adapted (mostly by humans) from LessWrong posts (or other writing LessWrongers might be familiar with).

Honestly, despite it starting out as an April fools joke, it's a really good album. We made probably 3,000-4,000 song generations to get the 15 we felt happy about, which I think works out to about 5-10 hours of work per song we used (including all the dead ends and things that never worked out).

The album is called I Have Been A Good Bing. I think it is a pretty...

(Continue Reading – 3000 more words)

qvalq2h10

Why has my comment been given so much karma?

Refusal in LLMs is mediated by a single direction

179

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda

Ω 708d

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2920 more words)

wassname3h10

If anyone wants to try this on llama-3 7b, I converted the collab to baukit, and it's available here.

Thomas Kwa's Shortform

Thomas Kwa

Ω 04y

4Thomas Kwa5h

I talked about this with Lawrence, and we both agree on the following: * There are mathematical models under which you should update >=1% in most weeks, and models under which you don't. * Brownian motion gives you 1% updates in most weeks. In many variants, like stationary processes with skew, stationary processes with moderately heavy tails, or Brownian motion interspersed with big 10%-update events that constitute <50% of your variance, you still have many weeks with 1% updates. Lawrence's model where you have no evidence until either AI takeover happens or 10 years passes does not give you 1% updates in most weeks, but this model is almost never the case for sufficiently smart agents. * Superforecasters empirically make lots of little updates, and rounding off their probabilities to larger infrequent updates make their forecasts on near-term problems worse. * Thomas thinks that AI is the kind of thing where you can make lots of reasonable small updates frequently. Lawrence is unsure if this is the state that most people should be in, but it seems plausibly true for some people who learn a lot of new things about AI in the average week (especially if you're very good at forecasting). * In practice, humans often update in larger discrete chunks. Part of this is because they only consciously think about new information required to generate new numbers once in a while, and part of this is because humans have emotional fluctuations which we don't include in our reported p(doom). * Making 1% updates in most weeks is not always just irrational emotional fluctuations; it is consistent with how a rational agent would behave under reasonable assumptions. However, we do not recommend that people consciously try to make 1% updates every week, because fixating on individual news articles is not the right way to think about forecasting questions, and it is empirically better to just think about the problem directly rather than obsessing about how many updates you're m

2JBlack6h

It definitely should not move by anything like a Brownian motion process. At the very least it should be bursty and updates should be expected to be very non-uniform in magnitude. In practice, you should not consciously update very often since almost all updates will be of insignificant magnitude on near-irrelevant information. I expect that much of the credence weight turns on unknown unknowns, which can't really be updated on at all until something turns them into (at least) known unknowns. But sure, if you were a superintelligence with practically unbounded rationality then you might in principle update very frequently.

Thomas Kwa4h20

The Brownian motion assumption is rather strong but not required for the conclusion. Consider the stock market, which famously has heavy-tailed, bursty returns. It happens all the time for the S&P 500 to move 1% in a week, but a 10% move in a week only happens a couple of times per decade. I would guess (and we can check) that most weeks have >0.6x of the average per-week variance of the market, which causes the median weekly absolute return to be well over half of what it would be if the market were Brownian motion with the same long-term variance.

Also, Lawrence tells me that in Tetlock's studies, superforecasters tend to make updates of 1-2% every week, which actually improves their accuracy.

2TsviBT9h

Probabilities on summary events like this are mostly pretty pointless. You're throwing together a bunch of different questions, about which you have very different knowledge states (including how much and how often you should update about them).

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Adversarial Examples: A Problem

Summary

Setup

Executive summary

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA