261

This is a linkpost for https://dynomight.net/seed-oil/

A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack:

“When are you going to write about seed oils?”

“Did you know that seed oils are why there’s so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?”

“Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?”

“Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?”

He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it’s critical that we overturn our lives...

(Continue Reading – 4926 more words)

1Celarix37m

We would still have to explain the downsides of obesity, and not just in the long-term health effects like heart disease or diabetes risks, but in the everyday life of having to carry around so much extra weight. Despite that, I'd still agree that being overweight is better than being underweight.

Ann18m10

You don't actually have to do any adjustments to the downsides, for beneficial statistical stories to be true. One point I was getting at, specifically, is that it is better than being dead or suffering in specific alternative ways, also. There can be real and clear downsides to carrying around significant amounts of weight, especially depending what that weight is, and still have that be present in the data in the first place because of good reasons.

I'll invoke the 'plane that comes back riddled in bullet holes, so you're trying to armor where the bullet ... (read more)

Heramb's Shortform

Heramb

3mo

Heramb1h12

Everyone who seems to be writing policy papers/ doing technical work seems to be keeping generative AI at the back of their mind, when framing their work or impact.

This narrow-eyed focus on gen AI might almost certainly be net-negative for us- unknowingly or unintentionally ignoring ripple effects of the gen AI boom in other fields (like robotics companies getting more funding leading to more capabilities, and that leads to new types of risks).

And guess who benefits if we do end up getting good evals/standards in place for gen AI? It seems to me companies/... (read more)

We are headed into an extreme compute overhang

devrandom

If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.

Definitions

Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff.

I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here).

I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.

Thesis

Due to practical reasons, the compute requirements for training LLMs...

(See More – 408 more words)

snewman1h10

All of this is plausible, but I'd encourage you to go through the exercise of working out these ideas in more detail. It'd be interesting reading and you might encounter some surprises / discover some things along the way.

Note, for example, that the AGIs would be unlikely to focus on AI research and self-improvement if there were more economically valuable things for them to be doing, and if (very plausibly!) there were not more economically valuable things for them to be doing, why wouldn't a big chunk of the 8 billion humans have been working on AI resea... (read more)

1snewman1h

Can you elaborate? This might be true but I don't think it's self-evidently obvious. In fact it could in some ways be a disadvantage; as Cole Wyeth notes in a separate top-level comment, "There are probably substantial gains from diversity among humans". 1.6 million identical twins might all share certain weaknesses or blind spots.

The Science Algorithm - AISC 2024 Final Presentation

Johannes C. Mayer

This is a linkpost for https://www.youtube.com/watch?v=eKypHdMveCU

I gave a presentation about what I have been working on in the last 3 months. Well a tiny part of it, as it was only a 10-minute presentation.

Here is the vector planning post mentioned in the talk.

Constructability: Plainly-coded AGIs may be feasible in the near future

Épiphanie Gédéon, Charbel-Raphaël

Charbel-Raphaël Segerie and Épiphanie Gédéon contributed equally to this post.
Many thanks to Davidad, Gabriel Alfour, Jérémy Andréoletti, Lucie Philippon, Vladimir Ivanov, Alexandre Variengien, Angélina Gentaz, Simon Cosson, Léo Dana and Diego Dorn for useful feedback.

TLDR: We present a new method for a safer-by design AI development. We think using plainly coded AIs may be feasible in the near future and may be safe. We also present a prototype and research ideas on Manifund.

Epistemic status: Armchair reasoning style. We think the method we are proposing is interesting and could yield very positive outcomes (even though it is still speculative), but we are less sure about which safety policy would use it in the long run.

Current AIs are developed through deep learning: the AI tries something, gets it wrong, then...

(Continue Reading – 3664 more words)

Rafael Kaufmann Nedal2h76

@Épiphanie Gédéon this is great, very complementary/related to what we've been developing for the Gaia Network. I'm particularly thrilled to see the focus on simplicity and incrementalism, as well as the willingness to roll up one's sleeves and write code (often sorely lacking in LW). And I'm glad that you are taking the map/territory problem seriously; I wholeheartedly agree with the following: "Most safe-by-design approaches seem to rely heavily on formal proofs. While formal proofs offer hard guarantees, they are often unreliable because their model of ... (read more)

2Ms. Haze18h

I think you accidentally a digit when editing this. It now says "7% accuracy".

2Charbel-Raphaël18h

Corrected

Refusal in LLMs is mediated by a single direction

121

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda

Ω 461d

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2920 more words)

Andy Arditi2h10

Absolutely! We think this is important as well, and we're planning to include these types of quantitative evaluations in our paper. Specifically we're thinking of examining loss over a large corpus of internet text, loss over a large corpus of chat text, and other standard evaluations (MMLU, and perhaps one or two others).

One other note on this topic is that the second metric we use ("Safety score") assesses whether the model completion contains harmful content. This does serve as some crude measure of a jailbreak's coherence - if after the intervention th... (read more)

17Nina Rimsky5h

I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction). The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.

6Neel Nanda6h

There's been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven't seen much elsewhere, but I could easily be missing references

7Neel Nanda6h

First and foremost, this is interpretability work, not directly safety work. Our goal was to see if insights about model internals could be applied to do anything useful on a real world task, as validation that our techniques and models of interpretability were correct. I would tentatively say that we succeeded here, though less than I would have liked. We are not making a strong statement that addressing refusals is a high importance safety problem. I do want to push back on the broader point though, I think getting refusals right does matter. I think a lot of the corporate censorship stuff is dumb, and I could not care less about whether GPT4 says naughty words. And IMO it's not very relevant to deceptive alignment threat models, which I care a lot about. But I think it's quite important for minimising misuse of models, which is also important: we will eventually get models capable of eg helping terrorists make better bioweapons (though I don't think we currently have such), and people will want to deploy those behind an API. I would like them to be as jailbreak proof as possible!

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

LessOnline (May 31—June 2, Berkeley, CA)

Ben Pace

1mo

This is a linkpost for http://Less.Online/

A Festival of Writers Who are Wrong on the Internet^[1]

LessOnline is a festival celebrating truth-seeking, optimization, and blogging. It's an opportunity to meet people you've only ever known by their LessWrong username or Substack handle.

We're running a rationalist conference!

The ticket cost is $400 minus your LW karma in cents.

Confirmed attendees include Scott Alexander, Zvi Mowshowitz, Eliezer Yudkowsky, Katja Grace, and Alexander Wales.

Less.Online

Go through to Less.Online to learn about who's attending, venue, location, housing, relation to Manifest, and more.

We'll post more updates about this event over the coming weeks as it all comes together.

If LessOnline is an awesome rationalist event,
I desire to believe that LessOnline is an awesome rationalist event;

If LessOnline is not an awesome rationalist event,
I desire to believe that LessOnline is not an awesome rationalist event;

Let me not become attached to beliefs I may not want.

—Litany of Rationalist Event Organizing

^{^}
But Striving to be Less So

OrthernLight3h10

Go through to Less.Online to learn about who's attending, venue, location, housing, relation to Manifest, and more.

While I may be missing the obvious, I didn't see the location anywhere on the site. ('Lighthaven', yes, but unless I've badly failed a search check, neither the LessOnline nor the Lighthaven website gives an address.)

Google Maps seems to know, but for something like this, confirmation would be nice; I don't quite trust that Google isn't showing a previous location or something else with the same name.

Magic by forgetting

avturchin

Epistemic – this post is more suitable for LW as it was 10 years ago

Thought experiment with curing a disease by forgetting

Imagine I have a bad but rare disease X. I may try to escape it in the following way:

1. I enter the blank state of mind and forget that I had X.

2. Now I in some sense merge with a very large number of my (semi)copies in parallel worlds who do the same. I will be in the same state of mind as other my copies, some of them have disease X, but most don’t.

3. Now I can use self-sampling assumption for observer-moments (Strong SSA) and think that I am randomly selected from all these exactly the same observer-moments.

4. Based on this, the chances that my next observer-moment after...

(Continue Reading – 1099 more words)

1justinpombrio13h

My point still stands. Try drawing out a specific finite set of worlds and computing the probabilities. (I don't think anything changes when the set of worlds becomes infinite, but the math becomes much harder to get right.)

2avturchin7h

The trick is to use already existing practice of meditation (or sleeping) and connect to it. Most people who go to sleep do no do it to use magic by forgetting, but it is natural to forget something during sleep. Thus, the fact that I wake up from sleeping does not provide any evidence about me having the disease. But it is in a sense parasitic behavior, and if everyone will use magic by forgetting every time she goes to sleep, there will be almost no gain. Except that one can "exchange" one bad thing on another, but will not remember the exchange.

justinpombrio3h32

Not "almost no gain". My point is that it can be quantified, and it is exactly zero expected gain under all circumstances. You can verify this by drawing out any finite set of worlds containing "mediators", and computing the expected number of disease losses minus disease gains as:

num(people with disease)*P(person with disease meditates)*P(person with disease who meditates loses the disease) - num(people without disease)*P(person without disease meditates)*P(person without disease who meditates gains the disease)

My point is that this number is always exactly zero. If you doubt this, you should try to construct a counterexample with a finite number of worlds.

[Aspiration-based designs] Outlook: dealing with complexity

Jobst Heitzig, jossoliver, thomasfinn

Summary. This teaser post sketches our current ideas for dealing with more complex environments. It will ultimately be replaced by one or more longer posts describing these in more detail. Reach out if you would like to collaborate on these issues.

Multi-dimensional aspirations

For real-world tasks that are specified in terms of more than a single evaluation metric, e.g., how much apples to buy and how much money to spend at most, we can generalize Algorithm 2 as follows from aspiration intervals to convex aspiration sets:

Assume there are $d > 1$ many evaluation metrics $u_{i}$ , combined into a vector-valued evaluation metric $u = (u_{1}, \dots, u_{d})$ .
Preparation: Pick $d + 1$ many linearly independent linear combinations $f_{j}$ in the space spanned by these metrics, and consider the $d + 1$ many policies $π_{j}$ each of which maximizes the expected value of the corresponding function $f_{j}$ . Let $V_{j} (s)$ and $Q_{j} (s, a)$ be the expected values of $u$ when using $π_{j}$ in state $s$ or after

...

(See More – 435 more words)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Definitions

Thesis

Executive summary

A Festival of Writers Who are Wrong on the Internet^[1]

Less.Online

Multi-dimensional aspirations

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA

Quick Takes

Popular Comments

Recent Discussion

Definitions

Thesis

Executive summary

A Festival of Writers Who are Wrong on the Internet[1]

Less.Online

Multi-dimensional aspirations

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA

A Festival of Writers Who are Wrong on the Internet^[1]