Epistemic – this post is more suitable for LW as it was 10 years ago

Thought experiment with curing a disease by forgetting

Imagine I have a bad but rare disease X. I may try to escape it in the following way:

1. I enter the blank state of mind and forget that I had X.

2. Now I in some sense merge with a very large number of my (semi)copies in parallel worlds who do the same. I will be in the same state of mind as other my copies, some of them have disease X, but most don’t.

3. Now I can use self-sampling assumption for observer-moments (Strong SSA) and think that I am randomly selected from all these exactly the same observer-moments.

4. Based on this, the chances that my next observer-moment after...

(Continue Reading – 1099 more words)

justinpombrio42m10

My point still stands. Try drawing out a specific finite set of worlds and computing the probabilities. (I don't think anything changes when the set of worlds becomes infinite, but the math becomes much harder to get right.)

Voting Theory Introduction

Scott Garrabrant

Sequence Introduction

This is the first post in a sequence in which I will propose a new voting system!

In this post, I introduce the framework and notation, and give some background on voting theory.

In the next post, I will show you the best voting system you've probably never heard of, maximal lotteries. (Seriously, it's really good.)

After that, I will make it even better, and propose a new system: maximal lottery-lotteries.

Then comes the bad news: I can't prove that maximal lottery-lotteries exist! (Or alternatively, good news: You can try to solve a cool new open problem in voting theory!)

Thanks to Jessica Taylor for first introducing me to maximal lotteries, and Sam Eisenstat for spending many hours with me trying to prove the existence of maximal lottery-lotteries.

Generalizing Voting Theory

A voting...

(Continue Reading – 1489 more words)

qvalq2h10

To get more comfortable with this formalism, we will translate three important voting criteria.

You translated four criteria.

Losing Faith In Contrarianism

omnizoid

Crosspost from my blog.

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

(Continue Reading – 1290 more words)

2ChristianKl3h

Instead of thinking about how you can divide a discussion into two sides you can also focus on "what's actually true". In that case, it would make sense to end with an estimation of the size of the real gap. If we, however, look at "what people argue", https://www1.udel.edu/educ/gottfredson/30years/Rushton-Jensen30years.pdf assumes the two categories culture-only (0% genetic–100% environmental) and the hereditarian (50% genetic–50% environmental). Jay M defines the environmental model as <33% genetic and the genetic model as >66% genetic. What Rushton called the hereditarian position is right in the middle between Jay's environmental and genetic model.

2Viliam5h

Thanks for the link. While it didn't convince me completely, it makes a good point that as long as there are some environmental factors for IQ (such as malnutrition), we should not make strong claims about genetic differences between groups unless we have controlled for these factors. (I suppose the conclusion that the genetic differences between races are real, but also entirely caused by factors such as nutrition, would succeed to make both sides angry. And yet, as far as I know, it might be true. Uhm... what is the typical Ashkenazi diet?)

Said Achmiz2h20

Uhm… what is the typical Ashkenazi diet?

It’s delicious, is what it is.

2Matthew Barnett7h

The statement I was replying to was: "I’d bet at upwards of 9 to 1 odds that Hanson is wrong about it." If one is incorrect about what Hanson believes about medicine, then that fact is relevant to whether you should make such a bet (or more generally whether you should have such a strong belief about him being "wrong"). This is independent of whatever message people received from reading Hanson.

Mercy to the Machine: Thoughts & Rights

False Name

12h

Abstract: First [1)], a suggested general method of determining, for AI operating under the human feedback reinforcement learning (HFRL) model, whether the AI is “thinking”; an elucidation of latent knowledge that is separate from a recapitulation of its training data. With independent concepts or cognitions, then, an early observation that AI or AGI may have a self-concept. Second [2)], by cited instances, whether LLMs have already exhibited independent (and de facto alignment-breaking) concepts or behavior; further observations of possible self-concepts exhibited by AI. Also [3)], whether AI has already broken alignment by forming its own “morality” implicit in its meta-prompts. Finally [4)], that if AI have self-concepts, and more, demonstrate aversive behavior to stimuli, that they deserve rights at least to be free of exposure to what is...

(Continue Reading – 4889 more words)

watermark2h10

i'm glad that you wrote about AI sentience (i don't see it talked about so often with very much depth), that it was effortful, and that you cared enough to write about it at all. i wish that kind of care was omnipresent and i'd strive to care better in that kind of direction.

and i also think continuing to write about it is very important. depending on how you look at things, we're in a world of 'art' at the moment - emergent models of superhuman novelty generation and combinatorial re-building. art moves culture, and culture curates humanity on aggregate s... (read more)

3the gears to ascension11h

You express intense frustration with your previous posts not getting the reception you intend. Your criticisms may be in significant part valid. I looked back at your previous posts; I think I still find them hard to read and mostly disagree, but I do appreciate you posting some of them, so I've upvoted. I don't think some of them were helpful. If you think it's worth the time, I can go back and annotate in more detail which parts I don't think are correct reasoning steps. But I wonder if that's really what you need right now? Expressing distress at being rejected here is useful, and I would hope you don't need to hurt yourself over it. If your posts aren't able to make enough of a difference to save us from catastrophe, I'd hope you could survive until the dice are fully cast. Please don't forfeit the game; if things go well, it would be a lot easier to not need to reconstruct you from memories and ask if you'd like to be revived from the damaged parts. If your life is spent waiting and hoping, that's better than if you're gone. And I don't think you should give up on your contributions being helpful yet. Though I do think you should step back and realize you're not the only one trying, and it might be okay even if you can't fix everything. Idk. I hope you're ok physically, and have a better day tomorrow than you did today.

3the gears to ascension11h

Hold up. I'm not sure what feedback to give about your post overall. I am impressed by it a significant way in, but then I get lost in what appear to be carefully-thought-through reasoning steps, and I'm not sure what to think after that point.

Playing Northboro with Lily and Rick

jefftk

This afternoon Lily, Rick, and I ("Dandelion") played our first dance together, which was also Lily's first dance. She's sat in with Kingfisher for a set or two many times, but this was her first time being booked and playing (almost) the whole time.

Lily started playing fiddle in Fall 2022, and after about a year she had enough tunes up to dance speed that I was thinking she'd be ready to play a low-stakes dance together soon. Not right away, but given how far out dances booked it seemed about time to start writing to some folks: by the time we were actually playing the dance she'd have even more tunes and be more solid on her existing ones. She was very excited about this idea; very motivated by performing.

I wrote to a few dances, and...

(See More – 542 more words)

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda

Ω 3617h

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2445 more words)

2jbash4h

I notice that there are not-insane views that might say both of the "harmless" instruction examples are as genuinely bad as the instructions people have actually chosen to try to make models refuse. I'm not sure whether to view that as buying in to the standard framing, or as a jab at it. Given that they explicitly say they're "fun" examples, I think I'm leaning toward "jab".

13Nina Rimsky6h

FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs. I think this post is novel compared to both my work and RepE because they: * Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering * Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible) * Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation * Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering * Test on many different models * Describe a way of turning this into a weight-edit Edit: (Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them "dogpiling") I do agree that RepE should be included in a "related work" section of a paper but generally people should be free to post research updates on LW/AF that don't have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.

Dan H3hΩ120

is novel compared to... RepE

This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405

Demonstrate full ablation of the refusal behavior with much less effect on coherence

In our paper and notebook we show the models are coherent.

Investigate projection

We did investigate projection too (we use it for concept removal in the RepE paper) but didn't find a substantial benefit for jailbreaking.

harmful/harmless instructions

We use harmful/harmless instructions.

Find that projecting away the (same, linear) feature at all lay

... (read more)

3Dan H4h

I agree if they simultaneously agree that they don't expect the post to be cited. These can't posture themselves as academic artifacts ("Citing this work" indicates that's the expectation) and fail to mention related work. I don't think you should expect people to treat it as related work if you don't cover related work yourself. Otherwise there's a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further. The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is. I'll note pretty much every time I mention something isn't following academic standards on LW I get ganged up on and I find it pretty weird. I've reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

A list of core AI safety problems and how I hope to solve them

161

davidad

Ω 558mo

Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by Safeguarded AI (formerly known as an Open Agency Architecture, or OAA), if it turns out to be feasible.

1. Value is fragile and hard to specify.

See: Specification gaming examples, Defining and Characterizing Reward Hacking^[1]

OAA Solution:

1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is...

(Continue Reading – 1422 more words)

ThomasCederborg3h20

There is a serious issue with your proposed solution to problem 13. Using a random dictator policy as a negotiation baseline is not suitable for the situation, where billions of humans are negotiating about the actions of a clever and powerful AI. One problem with using this solution, in this contexts, is that some people have strong commitments to moral imperatives, along the lines of ``heretics deserve eternal torture in hell''. The combination of these types of sentiments, and a powerful and clever AI (that would be very good at thinking up effective wa... (read more)

D&D.Sci Long War: Defender of Data-mocracy

aphyer

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

STORY (skippable)

You have the excellent fortune to live under the governance of The People's Glorious Free Democratic Republic of Earth, giving you a Glorious life of Freedom and Democracy.

Sadly, your cherished values of Democracy and Freedom are under attack by...THE ALIEN MENACE!

The typical reaction of an Alien Menace to hearing about Freedom and Democracy. (Generated using OpenArt SDXL).

Faced with the desperate need to defend Freedom and Democracy from The Alien Menace, The People's Glorious Free Democratic Republic of Earth has been forced to redirect most of its resources into the Glorious Free People's Democratic War...

(See More – 874 more words)

2abstractapplic4h

Could you elaborate on this? I think I'd do better relative to best play with and do better relative to random play with so it's not clear which way I should lean; also, I don't know how you plan to quantify "relative to".

4aphyer3h

I'm likely not to actually quantify 'relative to' - there might be an ordered list of players if it seems reasonable to me (for example, if one submission uses 10 soldiers to get a 50% winrate and one uses 2 soldiers to get a 49% winrate, I would feel comfortable ranking the second ahead of the first - or if all players decide to submit the same number of soldiers, the rankings will be directly comparable), but more likely I'll just have a chart as in your Boojumologist scenario: with one line added for 'optimal play' (above or equal to all players) and one for 'random play' (hopefully below all players). Overall, I don't think there's much optimization of the leaderboard/plot available to you - if you find yourself faced with a tough choice between an X% winrate with 9 soldiers or a Y% winrate with 8 soldiers, I don't anticipate the leaderboard taking a position on which of those is 'better'.

abstractapplic3h20

That makes sense, ty.

2abstractapplic4h

What we're facing: Relevant Weapons: Current strategies per number of soldiers: If I have to pick one strategy:

Propagating Facts into Aesthetics

115

Raemon

Epistemic status: Tentative. I’ve been practicing this on-and-off for a year and it’s seemed valuable, but it’s the sort of thing I might look back on and say “hmm, that wasn’t really the right frame to approach it from.”

In doublecrux, the focus is on “what observations would change my mind?”

In some cases this is (relatively) straightforward. If you believe minimum wage helps workers, or harms them, there are some fairly obvious experiments you might run. “Which places have instituted minimum wage laws? What happened to wages? What happened to unemployment? What happened to worker migration?”

The details will matter a lot. The results of the experiment might be weird and confusing. If I ran the experiment myself I’d probably get a lot of things wrong, misuse statistics and...

(Continue Reading – 3240 more words)

Jiao Bu3h10

Are you familiar at all with the works of Christopher Alexander? He spent about 50 years exploring the objectivity of aesthetics in Architecture (and was highly influential across several fields, including software design). His book "The Timeless Way of Building" is available as an Audiobook and is approachable. It is also the closest thing I have ever read to the teachings of my Tantric Teachers in India.

Basically, the book is about a "Pattern Language" by which beautiful things happen. The hard part though is getting people to be ... (read more)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Sequence Introduction

Generalizing Voting Theory

Executive summary

1. Value is fragile and hard to specify.

STORY (skippable)

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA