When it comes to coordinating people around a goal, you don't get limitless communication bandwidth for conveying arbitrarily nuanced messages. Instead, the "amount of words" you get to communicate depends on how many people you're trying to coordinate. Once you have enough people....you don't get many words.

Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Ask LLMs for feedback on "the" rather than "my" essay/response/code, to get more critical feedback. Seems true anecdotally, and prompting GPT-4 to give a score between 1 and 5 for ~100 poems/stories/descriptions resulted in an average score of 4.26 when prompted with "Score my ..." versus an average score of 4.0 when prompted with "Score the ..." (code).
New Kelsey Piper article and twitter thread on OpenAI equity & non-disparagement. It has lots of little things that make OpenAI look bad. It further confirms that OpenAI threatened to revoke equity unless employees signed the non-disparagement agreements Plus it shows Altman's signature on documents giving the company broad power over employees' equity — perhaps he doesn't read every document he signs, but this one seems quite important. This is all in tension with Altman's recent tweet that "vested equity is vested equity, full stop" and "i did not know this was happening." Plus "we have never clawed back anyone's vested equity, nor will we do that if people do not sign a separation agreement (or don't agree to a non-disparagement agreement)" is misleading given that they apparently regularly threatened to do so (or something equivalent — let the employee nominally keep their PPUs but disallow them from selling them) whenever an employee left. Great news: > OpenAI told me that “we are identifying and reaching out to former employees who signed a standard exit agreement to make it clear that OpenAI has not and will not cancel their vested equity and releases them from nondisparagement obligations” (Unless "employees who signed a standard exit agreement" is doing a lot of work — maybe a substantial number of employees technically signed nonstandard agreements.) I hope to soon hear from various people that they have been freed from their nondisparagement obligations. ---------------------------------------- Update: OpenAI says: > As we shared with employees today, we are making important updates to our departure process. We have not and never will take away vested equity, even when people didn't sign the departure documents. We're removing nondisparagement clauses from our standard departure paperwork, and we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We'll communicate this message to former employees. We're incredibly sorry that we're only changing this language now; it doesn't reflect our values or the company we want to be. ---------------------------------------- [Low-effort post; might have missed something important.] [Substantively edited after posting.]
Had a dream last night in which I was having a conversation on LessWrong - unfortunately, I can't remember most of the details of my dreams unless I deliberately concentrate on what happened as soon as I wake up, so I don't know what the conversation was about. But I do remember that I realized halfway through the conversation that I had been clicking on the wrong buttons - clicking "upvote" & "downvote" instead of "agree" and "disagree", and vice versa. In my dream, the first and second pairs of buttons looked identical - both of them were just the < and > signs. I suggested to the LW team that they put something to clarify which buttons were which - maybe write the words "upvote", "downvote", "agree", and "disagree" above the buttons. They thought that putting the words there would look really ugly and clutter up the UI too much. But when I woke up, it turned out that the actual site has a checkmark and an X for the second pair of buttons! And it also displays what each one means when you hover over it! So thanks for retroactively solving my problem, LW team!
Anyone here happen to have a round plane ticket from Virginia to Berkeley, CA lying around? I managed to get reduced price tickets to LessOnline, but I can't reasonably afford to fly there, given my current financial situation. This is a (really) long-shot, but thought it might be worth asking lol.
The Scaling Monosemanticity paper doesn't do a good job comparing feature clamping to steering vectors.  > To better understand the benefit of using features, for a few case studies of interest, we obtained linear probes using the same positive / negative examples that we used to identify the feature, by subtracting the residual stream activity in response to the negative example(s) from the activity in response to the positive example(s). We experimented with (1) visualizing the top-activating examples for probe directions, using the same pipeline we use for our features, and (2) using these probe directions for steering. 1. These vectors are not "linear probes" (which are generally optimized via SGD on a logistic regression task for a supervised dataset of yes/no examples), they are difference-in-means of activation vectors 1. So call them "steering vectors"! 2. As a side note, using actual linear probe directions tends to not steer models very well (see eg Inference Time Intervention table 3 on page 8) 2. In my experience, steering vectors generally require averaging over at least 32 contrast pairs. Anthropic only compares to 1-3 contrast pairs, which is inappropriate. 1. Since feature clamping needs fewer prompts for some tasks, that is a real benefit, but you have to amortize that benefit over the huge SAE effort needed to find those features.  2. Also note that you can generate synthetic data for the steering vectors using an LLM, it isn't too hard. 3. For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true)  > In all cases, we were unable to interpret the probe directions from their activating examples. In most cases (with a few exceptions) we were unable to adjust the model’s behavior in the expected way by adding perturbations along the probe directions, even in cases where feature steering was successful (see this appendix for more details). > > ... > > We note that these negative results do not imply that linear probes are not useful in general. Rather, they suggest that, in the “few-shot” prompting regime, they are less interpretable and effective for model steering than dictionary learning features. I totally expect feature clamping to still win out in a bunch of comparisons, it's really cool, but Anthropic's actual comparisons don't seem good and predictably underrate steering vectors. The fact that the Anthropic paper gets the comparison (and especially terminology) meaningfully wrong makes me more wary of their results going forwards.

Popular Comments

Recent Discussion

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

5Tao Lin
lol Paul is a very non-disparaging person. He always makes his criticism constructive, i don't know if there's any public evidence of him disparaging anyone regardless of NDAs

Wow, good point. I've never considered that aspect. 

Also, there's now a second detected human case, this one in Michigan instead of Texas. Both had a surprising-to-me "pinkeye" symptom profile. Weird! The dairy worker in Michigan had various "compartments" tested and their nasal compartment (and people they lived with) were all negative. Hopeful? Apparently and also hopefully this virus is NOT freakishly good at infecting humans and also weirdly many other animals (like covid was with human ACE2, in precisely the ways people have talked about when discussing gain-of-function in years prior to covid). If we're being foolishly mechanical in our inferences "n=2 with 2 survivors" could get rule of succession treatment. In that case we pseudocount 1 for each category of interest (hence if n=0 we say 50% survival chance based on nothing but pseudocounts), and now we have 3 survivors (2 real) versus 1 dead (0 real) and guess that the worst the mortality rate here would be maybe 1/4 == 25% (?? (as an ass number)), which is pleasantly lower than overall observed base rates for avian flu mortality in humans! :-) Naive impressions: a natural virus, with pretty clear reservoirs (first birds and now dairy cows), on the maybe slightly less bad side of "potentially killing millions of people"? I haven't heard anything about sequencing yet (hopefully in a BSL4 (or homebrew BSL5, even though official BSL5s don't exist yet), but presumably they might not bother to treat this as super dangerous by default until they verify that it is positively safe) but I also haven't personally looked for sequencing work on this new thing. When people did very dangerous Gain-of-Function research with a cousin of this, in ferrets, over 10 year ago (causing a great uproar among some) the supporters argued that it was was worth creating especially horrible diseases on purpose in labs in order to see the details, like a bunch of geeks who would Be As Gods And Know Good From Evil... and they confirmed back then that a handful of mutations separated
I haven't followed closely - from outside, it seems like pretty standard big-growth-tech behavior.  One thing to keep in mind is that "vested equity" is pretty inviolable.  These are grants that have been fully earned and delivered to the employee, and are theirs forever.  It's the "unvested" or "semi-vested" equity that's usually in question - these are shares that are conditionally promised to employees, which will vest at some specified time or event - usually some combination of time in good standing and liquidity events (for a non-public company). It's quite possible (and VERY common) that employees who leave are offered "accelerated vesting" on some of their granted-but-not-vested shares in exchange for signing agreements and making things easy for the company.  I don't know if that's what OpenAI is doing, but I'd be shocked if they somehow took away any vested shares from departing employees. It would be pretty sketchy to consider unvested grants to be part of one's net worth - certainly banks won't lend on it.  Vested shared are just shares, they're yours like any other asset.

 I don't know if that's what OpenAI is doing, but I'd be shocked if they somehow took away any vested shares from departing employees.

Consider yourself shocked.

1James Payor
So I'm guessing this covers like 2-4 recent departures, and not Paul, Dario, or the others that split earlier
This could be true for most cases though

In the last year, we’ve seen a surge of interest in AI safety. Many young professionals and aspiring researchers are attempting or seriously considering a career shift into this and related fields. We’ve seen a corresponding rise in months-long technical bootcamps and research programs, but mentorship has failed to keep pace with this rise. This is a staggering gap, and we intend to fill it - starting now. 

Enter the Mentorship in AGI Safety (MAGIS) program, a joint initiative between AI Safety Quest and Sci.STEPS

The pilot program will recruit mentors from the community and pair them with mentees according to self-reported background and professional goals, including technical experience, career advice, and soft skills. Mentors will meet with mentees 6 times over 3 months to provide guidance tailored to their specific...

tldr: I conducted 17 semi-structured interviews of AI safety experts about their big picture strategic view of the AI safety landscape: how will human-level AI play out, how things might go wrong, and what should the AI safety community be doing. While many respondents held “traditional” views (e.g. the main threat is misaligned AI takeover), there was more opposition to these standard views than I expected, and the field seems more split on many important questions than someone outside the field may infer.

What do AI safety experts believe about the big picture of AI risk? How might things go wrong, what we should do about it, and how have we done so far? Does everybody in AI safety agree on the fundamentals? Which views are consensus, which...


What do AI safety experts believe about the big picture of AI risk?

I would be careful not to implicitly claim that these 17 people are a "representative sample" of the AI safety community. Or, if you do want to make that claim, I think it's important to say a lot more about how these particular participants were chosen and why you think they are represented.

At first glance, it seems to me like this pool of participants overrepresents some worldviews and under-represents others. For example, it seems like the vast majority of the participants either work fo... (read more)

FWIW one thing that jumps out to me is that it feels like this list comes in two halves each complaining about the other: one that thinks AI safety should be less theoretical, less insular, less extreme, and not advocate pause; and one that thinks that it should be more independent, less connected to leading AGI companies, and more focussed on policy. They aren't strictly opposed (e.g. one could think people overrate pause but underrate policy more broadly), but I would strongly guess that the underlying people making some of these complaints are thinking of the underlying people making others.

(In the following I am talking about "love" towards human beings only, not love of other things (such as music or food or God).)

A pet topic of mine is that the term love is so ambiguous as to be nigh-useless in rational discourse. But whenever I bring up the topic, people tend to dismiss and ignore it. Let us see if Less Wrong will do likewise.

Modern western culture (and maybe also other cultures) is obsessed with the ideal of love. Love is pretty much by definition the best thing in life which everyone should strive for.

The problem is that people don't agree on what love means. 

Everyone will acknowledge that love can mean different things. But my claim is that most people do not truly understand this, even...

Answer by Ustice10

“Love” is just a broad category of feelings. In English, if you need to be specific, there are specifiers, but most of the time context is enough. For instance, if I say, “I love my nephew,” you’re probably not thinking that I have romantic feelings towards him, but you might think that his presence makes me happy or that I’d be willing to sacrifice more for his benefit than typical for humans in general.

Are you going to have a perfect model of my feelings? No. You can never be specific enough for that. But you’ll likely be 9/10 right. Usually, that’s good enough.

1Answer by RamblinDash
Another aspect of Love that's not really addressed here I tend to think of as a sense of 'being on the same team.' When I relate to people I love, I might help them or do something nice for them for the same reasons that Draymond Green passes the ball to Steph Curry - because when Steph makes a 3, the team's score increases and that's what they are trying to do. Draymond doesn't (or at least shouldn't) hold onto the ball and try to score himself unless he has a better shot (he usually doesn't) - points are points. Whereas when interacting with someone I don't love, I might help them to the extent it advances my own goals, broadly defined (which includes things like 'being well liked', 'getting helped in the future', 'the feeling of doing a good deed').
It's articles like these that make it clear that trying to extend rationalism to every aspect of the human condition is doomed to fail, and not only to fail, but to make anybody who makes the attempt seem like an alien to normal people. There are people who have been Talking about the different types of love and what love actually means thousands of years. The Greeks talked about the difference between Eros and Agape. Today on poly forms, you can see people talking about all the different types of love they have for their partners throwing around words like "new relationship energy" and "limerance." Well known biblical quotes like "Greater love hath no man than this, that a man lay down his life for his friends," makes it clear that there are different types of love. Most people are just comfortable with context clues instead of working down a flow diagram to make sure they are using the perfect word at the moment. For example, if someone asks me how I'm feeling and I say "bad" the fact that I have the flu vs if I just got a divorce is enough to clue most people in to the fact that the words "feel" and "bad" are referring to physically and emotionally respectively. Most people and situations don't need more clarity than that for human relationships to progress.
2Answer by Dagon
Most people aren't confused, because they're not trying to be clear and rational.  They are definitely confusing to people who prefer specificity and operationally useful descriptions of experiences. It is used to mean a very wide range of positive feelings, and should generally be taken as poetry rather than communication.  Another framing is that the ambiguity makes it very useful for signaling affection and value, without being particularly binding in specific promises. Which is NOT to say it isn't "real".  Affection, caring, enjoyment, and willingness-to-sacrifice-for someone are all things that many individuals experience for other individuals.  The exact thresholds and mix of qualia that gets described as "love" varies widely, but the underlying feelings seem near-universal.

This is the third of three posts summarizing what I learned when I interviewed 17 AI safety experts about their "big picture" of the existential AI risk landscape: how AGI will play out, how things might go wrong, and what the AI safety community should be doing. See here for a list of the participants and the standardized list of questions I asked.

This post summarizes the responses I received from asking “Are there any big mistakes the AI safety community has made in the past or are currently making?”

A rough decompositions of the main themes brought up. The figures omit some less popular themes, and double-count respondents who brought up more than one theme.

Yeah, probably most things people are doing are mistakes. This is just some random group


9 respondents were concerned about an overreliance or overemphasis on certain kinds of theoretical arguments underpinning AI risk

I agree with this, but that "the horsepower of AI is instead coming from oodles of training data" is not a fact that seems relevant to me, except in the sense that this is driving up AI-related chip manufacturing (which, however, wasn't mentioned). The reason I argue it's not otherwise relevant is that the horsepower of ASI will not, primarily, come from oodles of training data. To the contrary, it will come from being able to re... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Anyone here happen to have a round plane ticket from Virginia to Berkeley, CA lying around? I managed to get reduced price tickets to LessOnline, but I can't reasonably afford to fly there, given my current financial situation. This is a (really) long-shot, but thought it might be worth asking lol.

10 AI dropped a model on Lmsys that is doing fairly well, briefly overtaking Claude Opus before slipping a bit. Just another reminder that, as we wring our hands about dodgy behavior by Open AI, apparently these Chinese firms are getting compute (despite our efforts to restrict this) and releasing powerful and competitive models.

Dear Lesser Wrongers and Admirers of Eliezer Yudkowsky's Insightful Work on AI Safety,

I am delighted to share with you my latest (non-AI generated) creative endeavor—a short comic titled "The Button." This piece delves into the intricate and often unsettling alignment problem, presenting it through the sharp and sardonic perspective of a female android. The narrative unfolds as a dark comedy, capturing the android's keen observations of humanity's relentless pursuit of advanced AI, and revealing some unexpected and thought-provoking consequences.

I invite you to read and enjoy the comic here: Substack Link
(Don't be deceived by the so-called "chibi" style at the start.)

As a special treat, the comic includes a brief cameo of Eliezer himself, adding an extra layer of depth and homage to his influential work.

Thank you for your time and consideration. I look forward to hearing your thoughts and reflections.

Warm regards, Milan Rosko

The art is certainly striking. Nice job!

I'm a little unclear on the message you're trying to send. I think the button maybe would be better labeled "shut down"? Your presentation matches Eliezer Yudkowsky's and others' logic about the problem with shutdown buttons.

And the comment that this was inevitable seems pretty dark and not-helpful. Even those of us who think the odds aren't good are looking for ways to improve them. I'd hate to have your beautiful artwork just making people depressed if they could still help prevent this from happening. And I think ... (read more)

LessOnline Festival

May 31st to June 2nd, Berkeley CA