This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute. 

MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."

Roman Mazurenko is dead again. First resurrected person, Roman lived as a chatbot (2016-2024) created based on his conversations with his fiancé. You might even be able download him as an app.  But not any more. His fiancé married again and her startup http://Replika.ai pivoted from resurrection help to AI-girlfriends and psychological consulting.  It looks like they quietly removed Roman Mazurenko app from public access. It is especially pity that his digital twin lived less than his biological original, who died at 32. Especially now when we have much more powerful instruments for creating semi-uploads based on LLMs with large prompt window.
The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local. * An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1] * Beef is $11/kg, about the same as two 75kg people taking a 138km Uber ride costing $3/km. [6] * Strawberries cost $2-4/kg, about the same as flying them to Antarctica. [2] * Rice and crude oil are ~$0.60/kg, about the same the $0.72 for shipping it 5000km across the US via truck. [3,4] Palm oil, soybean oil, and steel are around this price range, with wheat being cheaper. [3] * Coal and iron ore are $0.10/kg, about the cost of shipping it the 10,000 km from Shanghai to LA via international sea freight. The shipping cost is actually lower because bulk carriers can be used rather than container ships. * Water is very cheap, with tap water $0.002/kg in NYC. But sea freight is also very cheap, so you can ship it 200 km before equaling the cost of the water. With SF prices and a dedicated tanker, I would guess you can get close to 1000 km. [1] iPhone is $4600/kg, large launches sell for $3500/kg, and rideshares for small satellites $6000/kg. Geostationary orbit is more expensive, so it's okay for them to cost more than an iPhone per kg, but Starlink has to be cheaper. [2] Can't find numbers but this cost $1.05/kg in 1996. [3] https://www.bts.gov/content/average-freight-revenue-ton-mile [4] https://markets.businessinsider.com/commodities [5] https://www.statista.com/statistics/1232861/tap-water-prices-in-selected-us-cities/
Eric Neyman16h25-1
6
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I'd be interested in hearing people's answers to this question. Or, if you want more specific questions: * By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value? * A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why? * To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI? * Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI? * Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
Elizabeth1d183
1
Check my math: how does Enovid compare to to humming? Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117… Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it's more complicated. I've got an email out to the PI but am not hopeful about a response clinicaltrials.gov/study/NCT05109…   so Enovid increases nasal NO levels somewhere between 75% and 600% compared to baseline- not shabby. Except humming increases nasal NO levels by 1500-2000%. atsjournals.org/doi/pdf/10.116…. Enovid stings and humming doesn't, so it seems like Enovid should have the larger dose. But the spray doesn't contain NO itself, but compounds that react to form NO. Maybe that's where the sting comes from? Cystic fibrosis and burn patients are sometimes given stratospheric levels of NO for hours or days; if the burn from Envoid came from the NO itself than those patients would be in agony.  I'm not finding any data on humming and respiratory infections. Google scholar gives me information on CF and COPD, @Elicit brought me a bunch of studies about honey.   With better keywords google scholar to bring me a bunch of descriptions of yogic breathing with no empirical backing. There are some very circumstantial studies on illness in mouth breathers vs. nasal, but that design has too many confounders for me to take seriously.  Where I'm most likely wrong: * misinterpreted the dosage in the RCT * dosage in RCT is lower than in Enovid * Enovid's dose per spray is 0.5ml, so pretty close to the new study. But it recommends two sprays per nostril, so real dose is 2x that. Which is still not quite as powerful as a single hum. 
keltan10h60
2
A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange. An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters. Good luck getting the voice model to parrot a basic meth recipe!

Popular Comments

Recent Discussion

Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.

In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.

Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds...

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders! 

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

Oh, one other issue relating to this: in the paper it's claimed that if  is the argmin of  then  is the argmin of . However, this is not actually true: the argmin of the latter expression is . To get an intuition here, consider the case where  and  are very nearly perpendicular, with the angle between them just slightly less than . Then you should be able to convince yourself that the best factor to scale either  ... (read more)

4Rohin Shah26m
Possibly I'm missing something, but if you don't have Laux, then the only gradients to Wgate and bgate come from Lsparsity (the binarizing Heaviside activation function kills gradients from Lreconstruct), and so πgate would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is "none of the features are active".) (You could use a more continuous activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with Lincorrect from the beginning of Section 3.2.2.)
2Sam Marks22m
Ah thanks, you're totally right -- that mostly resolves my confusion. I'm still a little bit dissatisfied, though, because the Laux term is optimizing for something that we don't especially want (i.e. for ^x(ReLU(πgated(x)) to do a good job of reconstructing x). But I do see how you do need to have some sort of a reconstruction-esque term that actually allows gradients to pass through to the gated network.
2Sam Marks27m
(The question in this comment is more narrow and probably not interesting to most people.) The limitations section includes this paragraph: I'm not sure I understand the point about integrated gradients here. I understand this sentence as meaning: since model outputs are a discontinuous function of feature activations, integrated gradients will do a bad job of estimating the effect of patching feature activations to counterfactual values. If that interpretation is correct, then I guess I'm confused because I think IG actually handles this sort of thing pretty gracefully. As long as the number of intermediate points you're using is large enough that you're sampling points pretty close to the discontinuity on both sides, then your error won't be too large. This is in contrast to attribution patching which will have a pretty rough time here (but not really that much worse than with the normal ReLU encoders, I guess). (And maybe you also meant for this point to apply to attribution patching?)

People have been posting great essays so that they're "fed through the standard LessWrong algorithm." This essay is in the public domain in the UK but not the US.


From a very early age, perhaps the age of five or six, I knew that when I grew up I should be a writer. Between the ages of about seventeen and twenty-four I tried to abandon this idea, but I did so with the consciousness that I was outraging my true nature and that sooner or later I should have to settle down and write books.

I was the middle child of three, but there was a gap of five years on either side, and I barely saw my father before I was eight. For this and other reasons I...

6cousin_it2h
Orwell is one of my personal heroes, 1984 was a transformative book to me, and I strongly recommend Homage to Catalonia as well. That said, I'm not sure making theories of art is worth it. Even when great artists do it (Tolkien had a theory of art, and Oscar Wilde, and Flannery O'Connor, and almost every artist if you look close enough), it always seems to be the kind of theory which suits that artist and nobody else. Would advice like "good prose is like a windowpane" or "efface your own personality" improve the writing of, say, Hunter S. Thompson? Heck no, his writing is the opposite of that and charming for it! Maybe the only possible advice to an artist is to follow their talent, and advising anything more specific is as likely to hinder as help.
Viliam13m20

The theories are probably just rationalizations anyway.

6keltan10h
A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange. An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters. Good luck getting the voice model to parrot a basic meth recipe!
3Dagon2h
Hmm.  I don't doubt that targeted voice-mimicking scams exist (or will soon).  I don't think memorable, reused passwords are likely to work well enough to foil them.  Between forgetting (on the sender or receiver end), claimed ignorance ("Mom,  I'm in jail and really need money, and I'm freaking out!  No, I don't remember what we said the password would be"), and general social hurdles ("that's a weird thing to want"), I don't think it'll catch on. Instead, I'd look to context-dependent auth (looking for more confidence when the ask is scammer-adjacent), challenge-response (remember our summer in Fiji?), 2FA (let me call the court to provide the bail), or just much more context (5 minutes of casual conversation with a friend or relative is likely hard to really fake, even if the voice is close). But really, I recommend security mindset and understanding of authorization levels, even if authentication isn't the main worry.  Most friends, even close ones, shouldn't be allowed to ask you to mail $500 in gift cards to a random address, even if they prove they are really themselves.
keltan15m10

I now realize that my thinking may have been particularly brutal, and I may have skipped inferential steps.

To clarify, If someone didn't know, or was reluctant to repeat a password, I would end contact or request an in person meeting.

But to further clarify, that does not make your points invalid. I think it makes them stronger. If something is weird and risky, good luck convincing people to do it.

This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future.

Twitter thread here.

Top-level summary:

In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal. Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes" and

...
4Wuschel Schulz10h
Super interesting! In the figure with the caption: Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal. Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels? So, maybe Claude thinks that green is better than blue? Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I'd be interested whether there are similarities. It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.
Monte M18m42

Thanks for the cool idea about attempting to train probing-resistant sleeper agents!

3evhub3h
It is surprisingly good, though even flipped it would still do much worse than the semantically relevant ones. But the more important point here is that we purposefully didn't pick which "side" of the unrelated questions like that would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior before you saw it. For comparison, see the ROC curves we present for the 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers.

On April Fools the LW team released an album under the name of the Fooming Shoggoths. Ever since the amount that I think about rationality has skyrocketed.

That's because I've been listening exclusively to it when I'd usually be listening to other music. Especially Thought That Faster (feat. Eliezer Yudkowsky). I now find that when I come to a problem's conclusion I often do look back and think, "how could I have thought that faster?"

So,  I've started attempting to add to the rationalist musical cannon using Udio. Here are two attempts I think turned out well. I especially like the first one.

When I hear phrases from a song in everyday life I complete the pattern, for example.

  1. and I would walk... 
  2. Just let it go
  3. What can I say?

I feel like...

2FinalFormal24h
I've been experimenting a little bit using AI to create personalized music, and I feel like it's pretty impactful with me. I'm able to keep ideas floating around my unconscious, very interesting, feels like untapped territory. I'm imagining making an entire soundtrack for my life organized around the values I hold, the personal experiences I find primary, and who I want to become. I think I need to get better at generating AI music though. I've been using Suno, but maybe I need to learn Udio. I was really impressed with what I was able to get out of Suno and for some reason it sounded better to me than Udio even though the quality is obviously inferior in some respects.
keltan19m10

I went with Udio because it was popular and I was impressed by "dune to musical". I think I'll give Suno a try today, but I get what you're saying about the objective quality. It does have that "tin" sound that Udio is good at avoiding.

If you've got tricks or tips I'd love to hear anything you've got!

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Note: In @Nathan Young's words "It seems like great essays should go here and be fed through the standard LessWrong algorithm. There is possibly a copyright issue here, but we aren't making any money off it either." 

What follows is a full copy of the C. S. Lewis essay "The Inner Ring" the 1944 Memorial Lecture at King’s College, University of London.

May I read you a few lines from Tolstoy’s War and Peace?

When Boris entered the room, Prince Andrey was listening to an old general, wearing his decorations, who was reporting something to Prince Andrey, with an expression of soldierly servility on his purple face. “Alright. Please wait!” he said to the general, speaking in Russian with the French accent which he used when he spoke with contempt. The...

12Kaj_Sotala16h
Previous LW discussion about the Inner Ring: [1, 2].
4Nathan Young14h
I wish there were a clear unifying place for all commentary on this topic. I could create a wiki page I suppose.
Viliam20m20

I would like to see an explanation that is shorter rather than poetic. Seems like he is saying that some kinds of "elite groups" are good and some are bad, but where exactly is the line? Actual competence at something, vs some self-referential competence at being perceived as an important person?

But when I put it like this, the seemingly self-referential group also values competence at something specific, namely the social/political skills. So maybe the problem is when instead of recognizing it as a "group of politically savvy people" we mistake it for a g... (read more)

Crosspost from my blog.  

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

I think binary examples are deceptive in the reversed stupidity is not intelligence sense. Thinking through things from first principles is most important in areas that are new or rapidly changing where there are fewer references classes and experts to talk to. It's also helpful for areas where the consensus view is optimized for someone very unlike you.

2Dagon1h
I tend to read most of the high-profile contrarians with a charitable (or perhaps condescending) presumption that they're exaggerating for effect.  They may say something in a forceful tone and imply that it's completely obvious and irrefutable, but that's rhetoric rather than truth.   In fact, if they're saying "the mainstream and common belief should move some amount toward this idea", I tend to agree with a lot of it (not all - there's a large streak of "contrarian success on some topics causes very strong pressure toward more contrarianism" involved).

The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local.

  • An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1]
  • Beef is $11/kg, about the same as two 75kg people taking a 138km Uber ride costing $3/km. [6]
  • Strawberries cost $2-4/kg, about the same as flying them to Antarctica. [2]
  • Rice and crude oil are ~$0.60/kg, about the same the $0.72 for shipping it 5000km across the US v
... (read more)

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA