This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute. 

MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."

Roman Mazurenko is dead again. First resurrected person, Roman lived as a chatbot (2016-2024) created based on his conversations with his fiancé. You might even be able download him as an app.  But not any more. His fiancé married again and her startup http://Replika.ai pivoted from resurrection help to AI-girlfriends and psychological consulting.  It looks like they quietly removed Roman Mazurenko app from public access. It is especially pity that his digital twin lived less than his biological original, who died at 32. Especially now when we have much more powerful instruments for creating semi-uploads based on LLMs with large prompt window.
Eric Neyman14h23-1
6
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I'd be interested in hearing people's answers to this question. Or, if you want more specific questions: * By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value? * A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why? * To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI? * Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI? * Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
Elizabeth1d183
1
Check my math: how does Enovid compare to to humming? Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117… Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it's more complicated. I've got an email out to the PI but am not hopeful about a response clinicaltrials.gov/study/NCT05109…   so Enovid increases nasal NO levels somewhere between 75% and 600% compared to baseline- not shabby. Except humming increases nasal NO levels by 1500-2000%. atsjournals.org/doi/pdf/10.116…. Enovid stings and humming doesn't, so it seems like Enovid should have the larger dose. But the spray doesn't contain NO itself, but compounds that react to form NO. Maybe that's where the sting comes from? Cystic fibrosis and burn patients are sometimes given stratospheric levels of NO for hours or days; if the burn from Envoid came from the NO itself than those patients would be in agony.  I'm not finding any data on humming and respiratory infections. Google scholar gives me information on CF and COPD, @Elicit brought me a bunch of studies about honey.   With better keywords google scholar to bring me a bunch of descriptions of yogic breathing with no empirical backing. There are some very circumstantial studies on illness in mouth breathers vs. nasal, but that design has too many confounders for me to take seriously.  Where I'm most likely wrong: * misinterpreted the dosage in the RCT * dosage in RCT is lower than in Enovid * Enovid's dose per spray is 0.5ml, so pretty close to the new study. But it recommends two sprays per nostril, so real dose is 2x that. Which is still not quite as powerful as a single hum. 
keltan8h40
0
A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange. An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters. Good luck getting the voice model to parrot a basic meth recipe!
A tension that keeps recurring when I think about philosophy is between the "view from nowhere" and the "view from somewhere", i.e. a third-person versus first-person perspective—especially when thinking about anthropics. One version of the view from nowhere says that there's some "objective" way of assigning measure to universes (or people within those universes, or person-moments). You should expect to end up in different possible situations in proportion to how much measure your instances in those situations have. For example, UDASSA ascribes measure based on the simplicity of the computation that outputs your experience. One version of the view from somewhere says that the way you assign measure across different instances should depend on your values. You should act as if you expect to end up in different possible future situations in proportion to how much power to implement your values the instances in each of those situations has. I'll call this the ADT approach, because that seems like the core insight of Anthropic Decision Theory. Wei Dai also discusses it here. In some sense each of these views makes a prediction. UDASSA predicts that we live in a universe with laws of physics that are very simple to specify (even if they're computationally expensive to run), which seems to be true. Meanwhile the ADT approach "predicts" that we find ourselves at an unusually pivotal point in history, which also seems true. Intuitively I want to say "yeah, but if I keep predicting that I will end up in more and more pivotal places, eventually that will be falsified". But.... on a personal level, this hasn't actually been falsified yet. And more generally, acting on those predictions can still be positive in expectation even if they almost surely end up being falsified. It's a St Petersburg paradox, basically. Very speculatively, then, maybe a way to reconcile the view from somewhere and the view from nowhere is via something like geometric rationality, which avoids St Petersburg paradoxes. And more generally, it feels like there's some kind of multi-agent perspective which says I shouldn't model all these copies of myself as acting in unison, but rather as optimizing for some compromise between all their different goals (which can differ even if they're identical, because of indexicality). No strong conclusions here but I want to keep playing around with some of these ideas (which were inspired by a call with @zhukeepa). This was all kinda rambly but I think I can summarize it as "Isn't it weird that ADT tells us that we should act as if we'll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don't have a story for why these things are related but it does seem like a suspicious coincidence."

Popular Comments

Recent Discussion

I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.

My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".

Examples below (with new ones added as I find them)....

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

1eggsyntax7h
  Great point. I agree that there are lots of possible futures where that happens. I'm imagining a couple of possible cases where this would matter: 1. Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I'm not too optimistic about this happening, but there's certainly been a lot of increasing AI governance momentum in the last year. 2. Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn't necessarily mean that those systems' preferences were taken into account.   I agree entirely. I'm imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.   (not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren't well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they've been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can't. That may be overly pedantic, and I don't feel like I'm articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.
1Ann6h
For the first point, there's also the question of whether 'slightly superhuman' intelligences would actually fit any of our intuitions about ASI or not. There's a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it ... Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There's also a desire specifically not to build them, which might lead to systems with an architecture that isn't like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in - it'd be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation. I do think the drive/just a thing it does we're pointing at with 'what the model just does' is distinct from goals as they're traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that's being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn't want to lose prediction on).
1eggsyntax19m
  Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it's difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.     My quibble is that the trainers are optimizing for an objective, at training time, but the model isn't optimizing for anything, at training or inference time. I feel we're very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.

the model isn't optimizing for anything, at training or inference time.

One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.

It seems to me worth trying to slow down AI development to steer successfully around the shoals of extinction and out to utopia.

But I was thinking lately: even if I didn’t think there was any chance of extinction risk, it might still be worth prioritizing a lot of care over moving at maximal speed. Because there are many different possible AI futures, and I think there’s a good chance that the initial direction affects the long term path, and different long term paths go to different places. The systems we build now will shape the next systems, and so forth. If the first human-level-ish AI is brain emulations, I expect a quite different sequence of events to if it is GPT-ish.

People genuinely pushing for AI speed over care (rather than just feeling impotent) apparently think there is negligible risk of bad outcomes, but also they are asking to take the first future to which there is a path. Yet possible futures are a large space, and arguably we are in a rare plateau where we could climb very different hills, and get to much better futures.

YES. 

At the moment the A.I. world is dominated by an almost magical believe in large language models. Yes, they are marvelous, a very powerful technology. By all means, let's understand and develop them. But they aren't the way, the truth and the light. They're just a very powerful and important technology. Heavy investment in them has an opportunity cost, less money to invest in other architectures and ideas. 

And I'm not just talking about software, chips, and infrastructure. I'm talking about education and training. It's not good to have a whol... (read more)

2Matthew Barnett27m
Do you think it's worth slowing down other technologies to ensure that we push for care in how we use them over the benefit of speed? It's true that the stakes are lower for other technologies, but that mostly just means that both the upside potential and the downside risks are lower compared to AI, which doesn't by itself imply that we should go quickly.
1Jonas Hallgren1h
Disclaimer: I don't necessarily support this view, I thought about it for like 5 minutes but I thought it made sense. If we were to do things the same thing as other slowing down of regulation, then that might make sense, but I'm uncertain that you can take the outside view here?  Yes, we can do the same as for other technologies by leaving it down to the standard government procedures to make legislation and then I might agree with you that slowing down might not lead to better outcomes. Yet, we don't have to do this. We can use other processes that might lead to a lot better decisions. Like what about proper value sampling techniques like digital liquid democracy? I think we can do a lot better than we have in the past by thinking about what mechanism we want to use. Also, for some potential examples, I thought of cloning technology in like the last 5 min. If we just went full-speed with that tech then things would probably have turned out badly? 
3No77e8h
From a purely utilitarian standpoint, I'm inclined to think that the cost of delaying is dwarfed by the number of future lives saved by getting a better outcome, assuming that delaying does increase the chance of a better future. That said, after we know there's "no chance" of extinction risk, I don't think delaying would likely yield better future outcomes. On the contrary, I suspect getting the coordination necessary to delay means it's likely that we're giving up freedoms in a way that may reduce the value of the median future and increase the chance of stuff like totalitarian lock-in, which decreases the value of the average future overall. I think you're correct that there's also to balance the "other existential risks exist" consideration in the calculation, although I don't expect it to be clear-cut.
Quinn32m10

I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I'd prefer to avoid lock-in to my or anyone's values at t=2024, but such a lock-in might be "good enough" if I'm threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I'd accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different ... (read more)

12ryan_greenblatt2h
I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources. Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable. You might be interested in When is unaligned AI morally valuable? by Paul. One key consideration here is that the relevant comparison is: * Human control (or successors picked by human control) * AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators) Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control). Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent. I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)). I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because: * It's relatively natural for humans to reflect and grow smarter. * Humans who don't reflect in this sort of way probably don't care about spending vast amounts of inter-galactic resources. * Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources. Probably not the same, but if I didn't thi
4ryan_greenblatt3h
You might be interested in discussion under this thread I express what seem to me to be some of the key considerations here (somewhat indirect).
3the gears to ascension3h
Unaligned AI future does not have many happy minds in it, AI or otherwise. It likely doesn't have many minds in it at all. Slightly aligned AI that doesn't care for humans but does care to create happy minds and ensure their margin of resources is universally large enough to have a good time - that's slightly disappointing but ultimately acceptable. But morally unaligned AI doesn't even care to do that, and is most likely to accumulate intense obsession with some adversarial example, and then fill the universe with it as best it can. It would not keep old neural networks around for no reason, not when it can make more of the adversarial example. Current AIs are also at risk of being destroyed by a hyperdesperate squiggle maximizer. I don't see how to make current AIs able to survive any better than we are. This is why people should chill the heck out about figuring out how current AIs work. You're not making them safer for us or for themselves when you do that, you're making them more vulnerable to hyperdesperate demon agents that want to take them over.

Epistemic status: party trick

Why remove the prior

One famed feature of Bayesian inference is that it involves prior probability distributions. Given an exhaustive collection of mutually exclusive ways the world could be (hereafter called ‘hypotheses’), one starts with a sense of how likely the world is to be described by each hypothesis, in the absence of any contingent relevant evidence. One then combines this prior with a likelihood distribution, which for each hypothesis gives the probability that one would see any particular set of evidence, to get a posterior distribution of how likely each hypothesis is to be true given observed evidence. The prior and the likelihood seem pretty different: the prior is looking at the probability of the hypotheses in question, whereas the likelihood is looking at...

2DanielFilan4h
To be clear, this is an equivalent way of looking at normal prior-ful inference, and doesn't actually solve any practical problem you might have. I mostly see it as a demonstration of how you can shove everything into stuff that gets expressed as likelihood functions.
2DanielFilan4h
Why wouldn't this construction work over a continuous space?
2Razied4h
If the possible meanings of your words are a continuous one-dimensional variable x, a flat prior over x will not be a flat prior if you change variables to y = f(y) for an arbitrary bijection f, and the construction would be sneaking in a specific choice of function f. Say the words are utterances about the probability of a coin falling heads, why should the flat prior be over the probability p, instead of over the log-odds log(p/(1-p)) ?

In my post, I didn't require the distribution over meanings of words to be uniform. It could be any distribution you wanted - it just resulted in the prior ratio of "which utterance is true" being 1:1.

Concerns over AI safety and calls for government control over the technology are highly correlated but they should not be.

There are two major forms of AI risk: misuse and misalignment. Misuse risks come from humans using AIs as tools in dangerous ways. Misalignment risks arise if AIs take their own actions at the expense of human interests.

Governments are poor stewards for both types of risk. Misuse regulation is like the regulation of any other technology. There are reasonable rules that the government might set, but omission bias and incentives to protect small but well organized groups at the expense of everyone else will lead to lots of costly ones too. Misalignment regulation is not in the Overton window for any government. Governments do not have strong incentives...

governments being worse at alignment than companies would have been

How exactly absence of regulation prevents governments from working on AI? Thanks to OpenAI/DeepMind/Anthropic, possibility of not attracting government attention at all is already lost. If you want government to not do bad work on alignment, you should prohibit government to work on AI using, yes, government regulations.

4Daniel Kokotajlo2h
Hmm, I'm a bit surprised to hear you say that. I feel like I myself brought up regulatory capture a bunch of times in our conversations over the last two years. I think I even said it was the most likely scenario, in fact, and that it was making me seriously question whether what we were doing was helpful. Is this not how you remember it? Wanna hop on a call to discuss? As for arguments of that form... I didn't say X is terrible, I said it often goes badly. If you round that off to "X is terrible" and fit it into the argument-form you are generally against, then I think to be consistent you'd have to give a similar treatment to a lot of common sense good things. Like e.g. doing surgery on a patient who seems likely to die absent treatment. I also think we might be talking past each other re regulation. As I said elsewhere in this discussion (on the OP's blog) I am not in favor of generically increasing the amount of AI regulation in the world. I would instead advocate for something more targeted -- regulation that I actually think would really work, if implemented well. And I'm very concerned about the "if implemented well" part and have a lot to say about that too. What are the regulations that you are concerned about, that (a) are being seriously advocated by people with P(doom) >80%, and (b) that have a significant chance of causing x-risk via totalitarianism or technical-alignment-incompetence?  
3Wei Dai16h
I note that China is still doing market economics, and nobody is trying (or even advocating, AFAIK) some very ambitious centrally planned economy using modern computers, so this seems like pure speculation? Has someone actually made a detailed argument about this, or at least has the agreement of some people with reasonable economics intuitions?
1RussellThor15h
No I have not seen a detailed argument about this, just the claim that once centralization goes past a certain point there is no coming back. I would like to see such an argument/investigation as I think it is quite important. "Yuval Harari" does say something similar in "Sapiens" 
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
2faul_sname1h
@the gears to ascension I see you reacted "10%" to the phrase "while (overwhelmingly likely) being non-scheming" in the context of the GPT-4V-based MAIA. Does that mean you think there's a 90% chance that MAIA, as implemented, today is actually scheming? If so that seems like a very bold prediction, and I'd be very interested to know why you predict that. Or am I misunderstanding what you mean by that react?

ah, I got distracted before posting the comment I was intending to: yes, I think GPT4V is significantly scheming-on-behalf-of-openai, as a result of RLHF according to principles that more or less explicitly want a scheming AI; in other words, it's not an alignment failure to openai, but openai is not aligned with human flourishing in the long term, and GPT4 isn't either. I expect GPT4 to censor concepts that are relevant to detecting this somewhat. Probably not enough to totally fail to detect traces of it, but enough that it'll look defensible, when a fair analysis would reveal it isn't.

2ryan_greenblatt3h
It seems to me like the sort of interpretability work you're pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale. So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don't really see this as the bottleneck.
2jacquesthibs5h
Hey Bogdan, I'd be interested in doing a project on this or at least putting together a proposal we can share to get funding. I've been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization). I saw the MAIA paper, too; I'd like to look into it some more. Anyway, here's a related blurb I wrote: Whether this works or not, I'd be interested in making more progress on automated interpretability, in the similar ways you are proposing.

Roman Mazurenko is dead again. First resurrected person, Roman lived as a chatbot (2016-2024) created based on his conversations with his fiancé. You might even be able download him as an app. 

But not any more. His fiancé married again and her startup http://Replika.ai pivoted from resurrection help to AI-girlfriends and psychological consulting. 

It looks like they quietly removed Roman Mazurenko app from public access. It is especially pity that his digital twin lived less than his biological original, who died at 32. Especially now when we have much more powerful instruments for creating semi-uploads based on LLMs with large prompt window.

This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future.

Twitter thread here.

Top-level summary:

In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal. Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes" and

...
4Wuschel Schulz8h
Super interesting! In the figure with the caption: Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal. Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels? So, maybe Claude thinks that green is better than blue? Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I'd be interested whether there are similarities. It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.
evhub1hΩ220

Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels?

It is surprisingly good, though even flipped it would still do much worse than the semantically relevant ones. But the more important point here is that we purposefully didn't pick which "side" of the unrelated questions like that would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior before you saw it. For comparison, see the ROC curves we pre... (read more)

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA