This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute. 

MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."

decision theory is no substitute for utility function some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following: > my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism. it's possible that this is true for some people, but in general i expect that to be a mistaken analysis of their values. decision theory cooperates with agents relative to how much power they have, and only when it's instrumental. in my opinion, real altruism (/egalitarianism/cosmopolitanism/fairness/etc) should be in the utility function which the decision theory is instrumental to, i actually intrinsically care about others; i don't just care about others instrumentally because it helps me somehow. some important aspects that my utility-function-altruism differs from decision-theoritic-cooperation includes: * i care about people weighed by moral patienthood, decision theory only cares about agents weighed by negotiation power. if an alien superintelligence is very powerful but isn't a moral patient, then i will only cooperate with it instrumentally (for example because i care about the alien moral patients that it has been in contact with); if cooperating with it doesn't help my utility function (which, again, includes altruism towards aliens) then i won't cooperate with that alien superintelligence. corollarily, i will take actions that cause nice things to happen to people even if they've very impoverished (and thus don't have much LDT negotiation power) and it doesn't help any other aspect of my utility function than just the fact that i value that they're okay. * if i can switch to a better decision theory, or if fucking over some non-moral-patienty agents helps me somehow, then i'll happily do that; i don't have goal-content integrity about my decision theory. i do have goal-content integrity about my utility function: i don't want to become someone who wants moral patients to unconsentingly-die or suffer, for example. * there seems to be a sense in which some decision theories are better than others, because they're ultimately instrumental to one's utility function. utility functions, however, don't have an objective measure for how good they are. hence, moral anti-realism is true: there isn't a Single Correct Utility Function. decision theory is instrumental; the utility function is where the actual intrinsic/axiomatic/terminal goals/values/preferences are stored. usually, i also interpret "morality" and "ethics" as "terminal values", since most of the stuff that those seem to care about looks like terminal values to me. for example, i will want fairness between moral patients intrinsically, not just because my decision theory says that that's instrumental to me somehow.
So the usual refrain from Zvi and others is that the specter of China beating us to the punch with AGI is not real because limits on compute, etc. I think Zvi has tempered his position on this in light of Meta's promise to release the weights of its 400B+ model. Now there is word that SenseTime just released a model that beats GPT-4 Turbo on various metrics. Of course, maybe Meta chooses not to release its big model, and maybe SenseTime is bluffing--I would point out though that Alibaba's Qwen model seems to do pretty okay in the arena...anyway, my point is that I don't think the "what if China" argument can be dismissed as quickly as some people on here seem to be ready to do.
The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local, but oil tankers exist. * An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1] * Beef, copper, and off-season strawberries are $11/kg, about the same as a 75kg person taking a three-hour, 250km Uber ride costing $3/km. * Oranges and aluminum are $2-4/kg, about the same as flying them to Antarctica. [2] * Rice and crude oil are ~$0.60/kg, about the same as $0.72 for shipping it 5000km across the US via truck. [3,4] Palm oil, soybean oil, and steel are around this price range, with wheat being cheaper. [3] * Coal and iron ore are $0.10/kg, significantly more than the cost of shipping it around the entire world via smallish (Handysize) bulk carriers. Large bulk carriers are another 4x more efficient [6]. * Water is very cheap, with tap water $0.002/kg in NYC. But shipping via tanker is also very cheap, so you can ship it maybe 1000 km before equaling its cost. It's really impressive that for the price of a winter strawberry, we can ship a strawberry-sized lump of coal around the world 100-400 times. [1] iPhone is $4600/kg, large launches sell for $3500/kg, and rideshares for small satellites $6000/kg. Geostationary orbit is more expensive, so it's okay for GPS satellites to cost more than an iPhone per kg, but Starlink wants to be cheaper. [2] https://fred.stlouisfed.org/series/APU0000711415. Can't find numbers but Antarctica flights cost $1.05/kg in 1996. [3] https://www.bts.gov/content/average-freight-revenue-ton-mile [4] https://markets.businessinsider.com/commodities [5] https://www.statista.com/statistics/1232861/tap-water-prices-in-selected-us-cities/ [6] https://www.researchgate.net/figure/Total-unit-shipping-costs-for-dry-bulk-carrier-ships-per-tkm-EUR-tkm-in-2019_tbl3_351748799
Eric Neyman2d44-6
11
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I'd be interested in hearing people's answers to this question. Or, if you want more specific questions: * By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value? * A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why? * To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI? * Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI? * Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
Fabien Roger1dΩ6130
0
List sorting does not play well with few-shot mostly doesn't replicate with davinci-002. When using length-10 lists (it crushes length-5 no matter the prompt), I get: * 32-shot, no fancy prompt: ~25% * 0-shot, fancy python prompt: ~60%  * 0-shot, no fancy prompt: ~60% So few-shot hurts, but the fancy prompt does not seem to help. Code here. I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I'm looking for counterexamples to the following conjecture: "fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting" (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).

Popular Comments

Recent Discussion

decision theory is no substitute for utility function

some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following:

my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism.

it's possible that this is true for some people, but in general i expect that to be a mistaken anal... (read more)

Abstract: First [1)], a suggested general method of determining, for AI operating under the human feedback reinforcement learning (HFRL) model, whether the AI is “thinking”; an elucidation of latent knowledge that is separate from a recapitulation of its training data. With independent concepts or cognitions, then, an early observation that AI or AGI may have a self-concept. Second [2)], by cited instances, whether LLMs have already exhibited independent (and de facto alignment-breaking) concepts or behavior; further observations of possible self-concepts exhibited by AI. Also [3)], whether AI has already broken alignment by forming its own “morality” implicit in its meta-prompts. Finally [4)], that if AI have self-concepts, and more, demonstrate aversive behavior to stimuli, that they deserve rights at least to be free of exposure to what is...

1False Name1h
Wanted to be loved. Loved, and to live a life not only avoiding fear. Epiphany (4/22/2024): am a fuckup. Have always been a fuckup. Could never have made anyone happy or been happy, and a hypothetical world never being born would have been a better world. Deserved downvotes, it has to be all bullshit, but LessWrong was supposed to make people less wrong, and should’ve given a comment to show why bullshit, but you didn’t, so LessWrong is a failure, too. So sterile, here, no connection with the world – how were we ever supposed to change anything? Stupid especially to’ve thought anyone would ever care. All fucked-up. Life was more enjoyable when it seemed there’d be more of it – when one could hope there’d be love, and less fear. Life enjoyable when it could be imagined as enjoyable. But music even hasn’t been anything, meant anything, in years. No enjoyment, now: fear. Hope was, after being locked in a room, not leaving in two years, until forgotten, the feeling of wind on skin, trying to produce thoughts new and useful, hoping for thoughts welcomed. Emerge, and nothing. How good will the world let you be? Think you choose your life and fate: choose to go faster than light; whether to have to be born, then. Contradict Kant, so do the impossible – no-one ever left a comment showing that to be erroneous (because they didn’t know what it was, or just didn’t give a damn?) – so, presumably true, and hoping to help by it – how good does the world let you be? Try to do something; when you do something, sacrifice two years of your life, something is supposed to happen. Thinking one day someone will care. There are no more days. What you do is supposed to matter. What you do in life is supposed to matter. Life is supposed to matter. And it doesn’t. Should have known from Tesla – capital, anyone, needn’t acknowledge work. Emerging after two years – such green, graceful petals – those clouds! – and stories, but the stories were all lies, always; CGI nothing: they’ve never

Hold up.

Is this a suicide note? Please don't go.

Your post is a lot, but I appreciate it existing. I appreciate you existing a lot more.

I'm not sure what feedback to give about your post overall. I am impressed by it a significant way in, but then I get lost in what appear to be carefully-thought-through reasoning steps, and I'm not sure what to think after that point.

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

"Why is there basically no widely used homoiconic language"

Well, there's Lisp, in its many variants.  And there's R.  Probably several others.

The thing is, while homoiconicity can be useful, it's not close to being a determinant of how useful the language is in practice.  As evidence, I'd point out that probably 90% of R users don't realize that it's homoiconic.

5Lucius Bushnaq10h
I don't think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.  It's a kind of prior like to use a lot, but that doesn't make it a sane choice.  A well-normalised prior for a regular model probably doesn't look very continuous or differentiable in this setting, I'd guess. The generic symmetries are not what I'm talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point. I know it 'deals with' unrealizability in this sense, that's not what I meant.  I'm not talking about the problem of characterising the posterior right when the true model is unrealizable. I'm talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class.  But looking at the green book, I see it's actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I'm going to have to translate more of this into Bayes to know what I think of it.  
2kave18h
Maybe "counterfactually robust" is an OK phrase?

Previously: General Thoughts on Secular Solstice.

This blog post is my scattered notes and ramblings about the individual components (talks and songs) of Secular Solstice in Berkeley. Talks have their title in bold, and I split the post into two columns, with the notes I took about the content of the talk on the left and my comments on the talk on the right. Songs have normal formatting.

Bonfire

The Circle

This feels like a sort of whig history: a history that neglects most of the complexities and culture-dependence of the past in order to advance a teleological narrative. I do not think that whig histories are inherently wrong (although the term has negative connotations). Whig histories should be held to a very strict standard because they make claims about how...

Thank you for responding! I am being very critical, both in foundational and nitpicky ways. This can be annoying and make people want to circle the wagons. But you and the other organizers are engaging constructively, which is great.

The distinction between Solstice representing a single coherent worldview vs. a series of reflections also came up in comments on a draft. In particular, the Spinozism of Songs Stay Sung feels a lot weirder if it is taken as the response to the darkness, which I initially did, rather than one response to the darkness.

Neverthele... (read more)

This is exploratory investigation of a new-ish hypothesis, it is not intended to be a comprehensive review of the field or even a a full investigation of the hypothesis.

I've always been skeptical of the seed-oil theory of obesity. Perhaps this is bad rationality on my part, but I've tended to retreat to the sniff test on issues as charged and confusing as diet. My response to the general seed-oil theory was basically "Really? Seeds and nuts? The things you just find growing on plants, and that our ancestors surely ate loads of?"

But a twitter thread recently made me take another look at it, and since I have a lot of chemistry experience I thought I'd take a look.

The PUFA Breakdown Theory

It goes like this:

PUFAs from nuts and...

1Joel Burget2h
One thing I like about the PUFA breakdown theory is that it agrees with aspects of so many different diets. * Keto avoids fried food because usually the food being fried is carbs * Carnivore avoids vegetable oils because they're not meat * Paleo avoids vegetable oils because they weren't available in the ancestral environment * Vegans tend to emphasize raw food and fried foods often have meat or cheese in them * Low-fat diets avoid fat of all kinds * Ray Peat was perhaps the closest to the mark in emphasizing that saturated fats are more stable (he probably talked about PUFA breakdown specifically, I'm not sure). Edit: I originally wrote "neatly explains why so many different diets are reported to work"
1Slapstick1h
I am confused by this sort of reasoning. As far as I'm aware, mainstream nutritional science/understanding already points towards avoiding refined oils (and refined sugars). There's already explainations for why cutting out refined oil is be beneficial. There are already reasonable explainations for why all of those diets might be reported to work, at least in the short term.

You're right, my original wording was too strong. I edited it to say that it agrees with so many diets instead of explains why they work.

1J Bostock2h
That post is part of what spurred this one

Looks like someone has worked on this kind of thing for different reasons https://www.worlddriven.org/

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Charbel-Raphaël Segerie and Épiphanie Gédéon contributed equally to this post. 
Many thanks to Davidad, Gabriel Alfour, Jérémy Andréoletti, Lucie Philippon, Vladimir Ivanov, Alexandre Variengien, Angélina Gentaz, Léo Dana and Diego Dorn for useful feedback.

TLDR: We present a new method for a safer-by design AI development. We think using plainly coded AIs may be feasible in the near future and may be safe. We also present a prototype and research ideas.

Epistemic status: Armchair reasoning style. We think the method we are proposing is interesting and could yield very positive outcomes (even though it is still speculative), but we are less sure about which safety policy would use it in the long run.

Current AIs are developed through deep learning: the AI tries something, gets it wrong, then gets backpropagated and all...

[We don't this long term vision is a core part of constructability, this is why we didn't put it in the main post]

We asked ourselves what should we do if constructability works in the long run. 

We are unsure, but here are several possibilities.

Constructability could lead to different possibilities depending on how well it works, from most to less ambitious:

  1. Using GPT-6 to implement GPT-7-white-box (foom?)
  2. Using GPT-6 to implement GPT-6-white-box
  3. Using GPT-6 to implement GPT-4-white-box
  4. Using GPT-6 to implement Alexa++, a humanoid housekeeper robot that ca
... (read more)

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

Is there anything interesting in jailbreak activations? Can model recognize that it would have refused if not jailbreak, so we can monitor jailbreaking attempts?

There was a period where everyone was really into basin broadness for measuring neural network generalization. This mostly stopped being fashionable, but I'm not sure if there's enough written up on why it didn't do much, so I thought I should give my take for why I stopped finding it attractive. This is probably a repetition of what others have found, but I thought I might as well repeat it.

Let's say we have a neural network . We evaluate it on a dataset  using a loss function , to find an optimum . Then there was an idea going around that the Hessian matrix (i.e. the second derivative of  at ) would tell us something about  (especially about how well it generalizes).

If we number the dataset , we can stack all the network outputs  which fits...

2tailcalled3h
Do you have ab outline of how SLT answers this?

ingular Sure! I'll try and say some relevant things below. In general, I suggest looking at Liam Carroll's distillation over Watanabe's book (which is quite heavy going, but good as a reference text). There are also some links below that may prove helpful. 

The empirical loss and its second derivative are statistical estimator of the population loss and its second derivative. Ultimately the latter controls the properties of the former (though the relation between the second derivative of the empirical loss and the second derivative of the population lo... (read more)

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA