This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute. 

MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."

Roman Mazurenko is dead again. First resurrected person, Roman lived as a chatbot (2016-2024) created based on his conversations with his fiancé. You might even be able download him as an app.  But not any more. His fiancé married again and her startup http://Replika.ai pivoted from resurrection help to AI-girlfriends and psychological consulting.  It looks like they quietly removed Roman Mazurenko app from public access. It is especially pity that his digital twin lived less than his biological original, who died at 32. Especially now when we have much more powerful instruments for creating semi-uploads based on LLMs with large prompt window.
Eric Neyman15h23-1
6
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I'd be interested in hearing people's answers to this question. Or, if you want more specific questions: * By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value? * A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why? * To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI? * Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI? * Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
Elizabeth1d183
1
Check my math: how does Enovid compare to to humming? Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117… Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it's more complicated. I've got an email out to the PI but am not hopeful about a response clinicaltrials.gov/study/NCT05109…   so Enovid increases nasal NO levels somewhere between 75% and 600% compared to baseline- not shabby. Except humming increases nasal NO levels by 1500-2000%. atsjournals.org/doi/pdf/10.116…. Enovid stings and humming doesn't, so it seems like Enovid should have the larger dose. But the spray doesn't contain NO itself, but compounds that react to form NO. Maybe that's where the sting comes from? Cystic fibrosis and burn patients are sometimes given stratospheric levels of NO for hours or days; if the burn from Envoid came from the NO itself than those patients would be in agony.  I'm not finding any data on humming and respiratory infections. Google scholar gives me information on CF and COPD, @Elicit brought me a bunch of studies about honey.   With better keywords google scholar to bring me a bunch of descriptions of yogic breathing with no empirical backing. There are some very circumstantial studies on illness in mouth breathers vs. nasal, but that design has too many confounders for me to take seriously.  Where I'm most likely wrong: * misinterpreted the dosage in the RCT * dosage in RCT is lower than in Enovid * Enovid's dose per spray is 0.5ml, so pretty close to the new study. But it recommends two sprays per nostril, so real dose is 2x that. Which is still not quite as powerful as a single hum. 
keltan9h40
1
A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange. An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters. Good luck getting the voice model to parrot a basic meth recipe!
A tension that keeps recurring when I think about philosophy is between the "view from nowhere" and the "view from somewhere", i.e. a third-person versus first-person perspective—especially when thinking about anthropics. One version of the view from nowhere says that there's some "objective" way of assigning measure to universes (or people within those universes, or person-moments). You should expect to end up in different possible situations in proportion to how much measure your instances in those situations have. For example, UDASSA ascribes measure based on the simplicity of the computation that outputs your experience. One version of the view from somewhere says that the way you assign measure across different instances should depend on your values. You should act as if you expect to end up in different possible future situations in proportion to how much power to implement your values the instances in each of those situations has. I'll call this the ADT approach, because that seems like the core insight of Anthropic Decision Theory. Wei Dai also discusses it here. In some sense each of these views makes a prediction. UDASSA predicts that we live in a universe with laws of physics that are very simple to specify (even if they're computationally expensive to run), which seems to be true. Meanwhile the ADT approach "predicts" that we find ourselves at an unusually pivotal point in history, which also seems true. Intuitively I want to say "yeah, but if I keep predicting that I will end up in more and more pivotal places, eventually that will be falsified". But.... on a personal level, this hasn't actually been falsified yet. And more generally, acting on those predictions can still be positive in expectation even if they almost surely end up being falsified. It's a St Petersburg paradox, basically. Very speculatively, then, maybe a way to reconcile the view from somewhere and the view from nowhere is via something like geometric rationality, which avoids St Petersburg paradoxes. And more generally, it feels like there's some kind of multi-agent perspective which says I shouldn't model all these copies of myself as acting in unison, but rather as optimizing for some compromise between all their different goals (which can differ even if they're identical, because of indexicality). No strong conclusions here but I want to keep playing around with some of these ideas (which were inspired by a call with @zhukeepa). This was all kinda rambly but I think I can summarize it as "Isn't it weird that ADT tells us that we should act as if we'll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don't have a story for why these things are related but it does seem like a suspicious coincidence."

Popular Comments

Recent Discussion

4keltan9h
A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange. An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters. Good luck getting the voice model to parrot a basic meth recipe!
Dagon12m20

Hmm.  I don't doubt that targeted voice-mimicking scams exist (or will soon).  I don't think memorable, reused passwords are likely to work well enough to foil them.  Between forgetting (on the sender or receiver end), claimed ignorance ("Mom,  I'm in jail and really need money, and I'm freaking out!  No, I don't remember what we said the password would be"), and general social hurdles ("that's a weird thing to want"), I don't think it'll catch on.

Instead, I'd look to context-dependent auth (looking for more confidence when the ask... (read more)

People have been posting great essays so that they're "fed through the standard LessWrong algorithm." This essay is in the public domain in the UK but not the US.


From a very early age, perhaps the age of five or six, I knew that when I grew up I should be a writer. Between the ages of about seventeen and twenty-four I tried to abandon this idea, but I did so with the consciousness that I was outraging my true nature and that sooner or later I should have to settle down and write books.

I was the middle child of three, but there was a gap of five years on either side, and I barely saw my father before I was eight. For this and other reasons I...

Orwell is one of my personal heroes, 1984 was a transformative book to me, and I strongly recommend Homage to Catalonia as well.

That said, I'm not sure making theories of art is worth it. Even when great artists do it (Tolkien had a theory of art, and Flannery O'Connor, and almost every artist if you look close enough), it always seems to be the kind of theory which suits that artist and nobody else. Would advice like "good prose is like a windowpane" or "efface your own personality" improve the writing of, say, Hunter S. Thompson? Heck no, his writing is ... (read more)

Abstract

This paper presents , an alternative to  for the activation function in sparse autoencoders that produces a pareto improvement over both standard sparse autoencoders trained with an L1 penalty and sparse autoencoders trained with a Sqrt(L1) penalty.

image

The gradient wrt.  is zero, so we generate two candidate classes of  differentiable wrt. :

PyTorch Implementation

Introduction

SAE Context and Terminology

Learnable parameters of a...

1wuthejeff5h
This is great!  We were working on very similar things concurrently at OpenAI but ended up going a slightly different route.  A few questions: - What does the distribution of learned biases look like? - For the STE variant, did you find it better to use the STE approximation for the activation gradient, even though the approximation is only needed for the bias?

Thank you!

That's super cool you've been doing something similar. I'm curious to see what direction you went in. It seemed like there's a large space of possible things to do along these lines. DeepMind also did a similar but different thing here.

What does the distribution of learned biases look like?

That's a great question, something I didn't note in here is that positive biases have no effect on the output of the SAE -- so, if the biases were to be mostly positive that would suggest this approach is missing something. I saved histograms of the biases duri... (read more)

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

It's measuring the volume of points in parameter space with loss  when  is infinitesimal. 

This is slightly tricky because it doesn't restrict itself to bounded parameter spaces,[1] but you can fix it with a technicality by considering how the volume scales with  instead.

In real networks trained with finite amounts of data, you care about the case where  is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss , if you ... (read more)

2tailcalled2h
Link(s) to your favorite proof(s)? Also, do these match up with empirical results? I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don't really need SLT to inoculate me against that. I'd mainly be interested if SLT shows something beyond that. As I read the empirical formulas in this paper, they're roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network. But then so they don't have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it. This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there's a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works. ---------------------------------------- I guess I'm still not super sold on it, but on reflection that's probably partly because I don't have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I'm not sure why I'd want/need to study it further. There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it's plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔
2Algon2h
Second most? What's the first? Linearization of a Newtonian V(r) about the earth's surface?
4tailcalled2h
Yes.

I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.

My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".

Examples below (with new ones added as I find them)....

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

1eggsyntax7h
  Great point. I agree that there are lots of possible futures where that happens. I'm imagining a couple of possible cases where this would matter: 1. Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I'm not too optimistic about this happening, but there's certainly been a lot of increasing AI governance momentum in the last year. 2. Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn't necessarily mean that those systems' preferences were taken into account.   I agree entirely. I'm imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.   (not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren't well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they've been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can't. That may be overly pedantic, and I don't feel like I'm articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.
1Ann6h
For the first point, there's also the question of whether 'slightly superhuman' intelligences would actually fit any of our intuitions about ASI or not. There's a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it ... Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There's also a desire specifically not to build them, which might lead to systems with an architecture that isn't like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in - it'd be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation. I do think the drive/just a thing it does we're pointing at with 'what the model just does' is distinct from goals as they're traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that's being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn't want to lose prediction on).
1eggsyntax39m
  Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it's difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.     My quibble is that the trainers are optimizing for an objective, at training time, but the model isn't optimizing for anything, at training or inference time. I feel we're very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.

the model isn't optimizing for anything, at training or inference time.

One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

It seems to me worth trying to slow down AI development to steer successfully around the shoals of extinction and out to utopia.

But I was thinking lately: even if I didn’t think there was any chance of extinction risk, it might still be worth prioritizing a lot of care over moving at maximal speed. Because there are many different possible AI futures, and I think there’s a good chance that the initial direction affects the long term path, and different long term paths go to different places. The systems we build now will shape the next systems, and so forth. If the first human-level-ish AI is brain emulations, I expect a quite different sequence of events to if it is GPT-ish.

People genuinely pushing for AI speed over care (rather than just feeling impotent) apparently think there is negligible risk of bad outcomes, but also they are asking to take the first future to which there is a path. Yet possible futures are a large space, and arguably we are in a rare plateau where we could climb very different hills, and get to much better futures.

YES. 

At the moment the A.I. world is dominated by an almost magical believe in large language models. Yes, they are marvelous, a very powerful technology. By all means, let's understand and develop them. But they aren't the way, the truth and the light. They're just a very powerful and important technology. Heavy investment in them has an opportunity cost, less money to invest in other architectures and ideas. 

And I'm not just talking about software, chips, and infrastructure. I'm talking about education and training. It's not good to have a whol... (read more)

2Matthew Barnett1h
Do you think it's worth slowing down other technologies to ensure that we push for care in how we use them over the benefit of speed? It's true that the stakes are lower for other technologies, but that mostly just means that both the upside potential and the downside risks are lower compared to AI, which doesn't by itself imply that we should go quickly.
1Jonas Hallgren2h
Disclaimer: I don't necessarily support this view, I thought about it for like 5 minutes but I thought it made sense. If we were to do things the same thing as other slowing down of regulation, then that might make sense, but I'm uncertain that you can take the outside view here?  Yes, we can do the same as for other technologies by leaving it down to the standard government procedures to make legislation and then I might agree with you that slowing down might not lead to better outcomes. Yet, we don't have to do this. We can use other processes that might lead to a lot better decisions. Like what about proper value sampling techniques like digital liquid democracy? I think we can do a lot better than we have in the past by thinking about what mechanism we want to use. Also, for some potential examples, I thought of cloning technology in like the last 5 min. If we just went full-speed with that tech then things would probably have turned out badly? 
3No77e8h
From a purely utilitarian standpoint, I'm inclined to think that the cost of delaying is dwarfed by the number of future lives saved by getting a better outcome, assuming that delaying does increase the chance of a better future. That said, after we know there's "no chance" of extinction risk, I don't think delaying would likely yield better future outcomes. On the contrary, I suspect getting the coordination necessary to delay means it's likely that we're giving up freedoms in a way that may reduce the value of the median future and increase the chance of stuff like totalitarian lock-in, which decreases the value of the average future overall. I think you're correct that there's also to balance the "other existential risks exist" consideration in the calculation, although I don't expect it to be clear-cut.
Quinn1h10

I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I'd prefer to avoid lock-in to my or anyone's values at t=2024, but such a lock-in might be "good enough" if I'm threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I'd accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different ... (read more)

12ryan_greenblatt3h
I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources. Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable. You might be interested in When is unaligned AI morally valuable? by Paul. One key consideration here is that the relevant comparison is: * Human control (or successors picked by human control) * AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators) Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control). Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent. I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)). I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because: * It's relatively natural for humans to reflect and grow smarter. * Humans who don't reflect in this sort of way probably don't care about spending vast amounts of inter-galactic resources. * Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources. Probably not the same, but if I didn't thi
4ryan_greenblatt3h
You might be interested in discussion under this thread I express what seem to me to be some of the key considerations here (somewhat indirect).
3the gears to ascension4h
Unaligned AI future does not have many happy minds in it, AI or otherwise. It likely doesn't have many minds in it at all. Slightly aligned AI that doesn't care for humans but does care to create happy minds and ensure their margin of resources is universally large enough to have a good time - that's slightly disappointing but ultimately acceptable. But morally unaligned AI doesn't even care to do that, and is most likely to accumulate intense obsession with some adversarial example, and then fill the universe with it as best it can. It would not keep old neural networks around for no reason, not when it can make more of the adversarial example. Current AIs are also at risk of being destroyed by a hyperdesperate squiggle maximizer. I don't see how to make current AIs able to survive any better than we are. This is why people should chill the heck out about figuring out how current AIs work. You're not making them safer for us or for themselves when you do that, you're making them more vulnerable to hyperdesperate demon agents that want to take them over.

Epistemic status: party trick

Why remove the prior

One famed feature of Bayesian inference is that it involves prior probability distributions. Given an exhaustive collection of mutually exclusive ways the world could be (hereafter called ‘hypotheses’), one starts with a sense of how likely the world is to be described by each hypothesis, in the absence of any contingent relevant evidence. One then combines this prior with a likelihood distribution, which for each hypothesis gives the probability that one would see any particular set of evidence, to get a posterior distribution of how likely each hypothesis is to be true given observed evidence. The prior and the likelihood seem pretty different: the prior is looking at the probability of the hypotheses in question, whereas the likelihood is looking at...

2DanielFilan4h
To be clear, this is an equivalent way of looking at normal prior-ful inference, and doesn't actually solve any practical problem you might have. I mostly see it as a demonstration of how you can shove everything into stuff that gets expressed as likelihood functions.
2DanielFilan4h
Why wouldn't this construction work over a continuous space?
2Razied4h
If the possible meanings of your words are a continuous one-dimensional variable x, a flat prior over x will not be a flat prior if you change variables to y = f(y) for an arbitrary bijection f, and the construction would be sneaking in a specific choice of function f. Say the words are utterances about the probability of a coin falling heads, why should the flat prior be over the probability p, instead of over the log-odds log(p/(1-p)) ?

In my post, I didn't require the distribution over meanings of words to be uniform. It could be any distribution you wanted - it just resulted in the prior ratio of "which utterance is true" being 1:1.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA