Ω 31y

decision theory is no substitute for utility function

some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following:

my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism.

it's possible that this is true for some people, but in general i expect that to be a mistaken anal... (read more)

Mercy to the Machine: Thoughts & Rights

False Name

Abstract: First [1)], a suggested general method of determining, for AI operating under the human feedback reinforcement learning (HFRL) model, whether the AI is “thinking”; an elucidation of latent knowledge that is separate from a recapitulation of its training data. With independent concepts or cognitions, then, an early observation that AI or AGI may have a self-concept. Second [2)], by cited instances, whether LLMs have already exhibited independent (and de facto alignment-breaking) concepts or behavior; further observations of possible self-concepts exhibited by AI. Also [3)], whether AI has already broken alignment by forming its own “morality” implicit in its meta-prompts. Finally [4)], that if AI have self-concepts, and more, demonstrate aversive behavior to stimuli, that they deserve rights at least to be free of exposure to what is...

(Continue Reading – 4889 more words)

1False Name1h

Wanted to be loved. Loved, and to live a life not only avoiding fear. Epiphany (4/22/2024): am a fuckup. Have always been a fuckup. Could never have made anyone happy or been happy, and a hypothetical world never being born would have been a better world. Deserved downvotes, it has to be all bullshit, but LessWrong was supposed to make people less wrong, and should’ve given a comment to show why bullshit, but you didn’t, so LessWrong is a failure, too. So sterile, here, no connection with the world – how were we ever supposed to change anything? Stupid especially to’ve thought anyone would ever care. All fucked-up. Life was more enjoyable when it seemed there’d be more of it – when one could hope there’d be love, and less fear. Life enjoyable when it could be imagined as enjoyable. But music even hasn’t been anything, meant anything, in years. No enjoyment, now: fear. Hope was, after being locked in a room, not leaving in two years, until forgotten, the feeling of wind on skin, trying to produce thoughts new and useful, hoping for thoughts welcomed. Emerge, and nothing. How good will the world let you be? Think you choose your life and fate: choose to go faster than light; whether to have to be born, then. Contradict Kant, so do the impossible – no-one ever left a comment showing that to be erroneous (because they didn’t know what it was, or just didn’t give a damn?) – so, presumably true, and hoping to help by it – how good does the world let you be? Try to do something; when you do something, sacrifice two years of your life, something is supposed to happen. Thinking one day someone will care. There are no more days. What you do is supposed to matter. What you do in life is supposed to matter. Life is supposed to matter. And it doesn’t. Should have known from Tesla – capital, anyone, needn’t acknowledge work. Emerging after two years – such green, graceful petals – those clouds! – and stories, but the stories were all lies, always; CGI nothing: they’ve never

the gears to ascension12m20

Hold up.

Is this a suicide note? Please don't go.

Your post is a lot, but I appreciate it existing. I appreciate you existing a lot more.

I'm not sure what feedback to give about your post overall. I am impressed by it a significant way in, but then I get lost in what appear to be carefully-thought-through reasoning steps, and I'm not sure what to think after that point.

Examples of Highly Counterfactual Discoveries?

152

johnswentworth, kromem

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

(See More – 189 more words)

Radford Neal26m10

"Why is there basically no widely used homoiconic language"

Well, there's Lisp, in its many variants. And there's R. Probably several others.

The thing is, while homoiconicity can be useful, it's not close to being a determinant of how useful the language is in practice. As evidence, I'd point out that probably 90% of R users don't realize that it's homoiconic.

5Lucius Bushnaq10h

I don't think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one. It's a kind of prior like to use a lot, but that doesn't make it a sane choice. A well-normalised prior for a regular model probably doesn't look very continuous or differentiable in this setting, I'd guess. The generic symmetries are not what I'm talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point. I know it 'deals with' unrealizability in this sense, that's not what I meant. I'm not talking about the problem of characterising the posterior right when the true model is unrealizable. I'm talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class. But looking at the green book, I see it's actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I'm going to have to translate more of this into Bayes to know what I think of it.

2kave18h

Maybe "counterfactually robust" is an OK phrase?

My Detailed Notes & Commentary from Secular Solstice

Jeffrey Heninger

1mo

Previously: General Thoughts on Secular Solstice.

This blog post is my scattered notes and ramblings about the individual components (talks and songs) of Secular Solstice in Berkeley. Talks have their title in bold, and I split the post into two columns, with the notes I took about the content of the talk on the left and my comments on the talk on the right. Songs have normal formatting.

Bonfire

The Circle

This feels like a sort of whig history: a history that neglects most of the complexities and culture-dependence of the past in order to advance a teleological narrative. I do not think that whig histories are inherently wrong (although the term has negative connotations). Whig histories should be held to a very strict standard because they make claims about how...

(Continue Reading – 3808 more words)

Jeffrey Heninger26m80

Thank you for responding! I am being very critical, both in foundational and nitpicky ways. This can be annoying and make people want to circle the wagons. But you and the other organizers are engaging constructively, which is great.

The distinction between Solstice representing a single coherent worldview vs. a series of reflections also came up in comments on a draft. In particular, the Spinozism of Songs Stay Sung feels a lot weirder if it is taken as the response to the darkness, which I initially did, rather than one response to the darkness.

Neverthele... (read more)

So What's Up With PUFAs Chemically?

J Bostock

This is exploratory investigation of a new-ish hypothesis, it is not intended to be a comprehensive review of the field or even a a full investigation of the hypothesis.

I've always been skeptical of the seed-oil theory of obesity. Perhaps this is bad rationality on my part, but I've tended to retreat to the sniff test on issues as charged and confusing as diet. My response to the general seed-oil theory was basically "Really? Seeds and nuts? The things you just find growing on plants, and that our ancestors surely ate loads of?"

But a twitter thread recently made me take another look at it, and since I have a lot of chemistry experience I thought I'd take a look.

The PUFA Breakdown Theory

It goes like this:

PUFAs from nuts and...

(Continue Reading – 1684 more words)

1Joel Burget2h

One thing I like about the PUFA breakdown theory is that it agrees with aspects of so many different diets. * Keto avoids fried food because usually the food being fried is carbs * Carnivore avoids vegetable oils because they're not meat * Paleo avoids vegetable oils because they weren't available in the ancestral environment * Vegans tend to emphasize raw food and fried foods often have meat or cheese in them * Low-fat diets avoid fat of all kinds * Ray Peat was perhaps the closest to the mark in emphasizing that saturated fats are more stable (he probably talked about PUFA breakdown specifically, I'm not sure). Edit: I originally wrote "neatly explains why so many different diets are reported to work"

1Slapstick1h

I am confused by this sort of reasoning. As far as I'm aware, mainstream nutritional science/understanding already points towards avoiding refined oils (and refined sugars). There's already explainations for why cutting out refined oil is be beneficial. There are already reasonable explainations for why all of those diets might be reported to work, at least in the short term.

Joel Burget29m10

You're right, my original wording was too strong. I edited it to say that it agrees with so many diets instead of explains why they work.

1J Bostock2h

That post is part of what spurred this one

Will_Pearson's Shortform

Will_Pearson

4mo

Will_Pearson1h10

Looks like someone has worked on this kind of thing for different reasons https://www.worlddriven.org/

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Constructability: AI safety via Pull Request

Épiphanie Gédéon, Charbel-Raphaël

Charbel-Raphaël Segerie and Épiphanie Gédéon contributed equally to this post.
Many thanks to Davidad, Gabriel Alfour, Jérémy Andréoletti, Lucie Philippon, Vladimir Ivanov, Alexandre Variengien, Angélina Gentaz, Léo Dana and Diego Dorn for useful feedback.

TLDR: We present a new method for a safer-by design AI development. We think using plainly coded AIs may be feasible in the near future and may be safe. We also present a prototype and research ideas.

Epistemic status: Armchair reasoning style. We think the method we are proposing is interesting and could yield very positive outcomes (even though it is still speculative), but we are less sure about which safety policy would use it in the long run.

Current AIs are developed through deep learning: the AI tries something, gets it wrong, then gets backpropagated and all...

(Continue Reading – 3658 more words)

Charbel-Raphaël1h40

[We don't this long term vision is a core part of constructability, this is why we didn't put it in the main post]

We asked ourselves what should we do if constructability works in the long run.

We are unsure, but here are several possibilities.

Constructability could lead to different possibilities depending on how well it works, from most to less ambitious:

Using GPT-6 to implement GPT-7-white-box (foom?)
Using GPT-6 to implement GPT-6-white-box
Using GPT-6 to implement GPT-4-white-box
Using GPT-6 to implement Alexa++, a humanoid housekeeper robot that ca

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda

Ω 306h

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2445 more words)

quetzal_rainbow1h10

Is there anything interesting in jailbreak activations? Can model recognize that it would have refused if not jailbreak, so we can monitor jailbreaking attempts?

Why I stopped being into basin broadness

tailcalled

There was a period where everyone was really into basin broadness for measuring neural network generalization. This mostly stopped being fashionable, but I'm not sure if there's enough written up on why it didn't do much, so I thought I should give my take for why I stopped finding it attractive. This is probably a repetition of what others have found, but I thought I might as well repeat it.

Let's say we have a neural network $f_{w} (x) : R^{n}$ . We evaluate it on a dataset $(x, y) \sim D$ using a loss function $L (^y, y) : R$ , to find an optimum $w^{*} = arg {min}_{w} E_{(x, y) \sim D} [L (f_{w} (x), y)]$ . Then there was an idea going around that the Hessian matrix (i.e. the second derivative of $E_{(x, y) \sim D} [L (f_{w} (x), y)]$ at $w^{*}$ ) would tell us something about $w^{*}$ (especially about how well it generalizes).

If we number the dataset $(x_{i}, y_{i})$ , we can stack all the network outputs ${^y}_{i} (w) = f_{w} (x_{i})$ which fits...

(See More – 570 more words)

2tailcalled3h

Do you have ab outline of how SLT answers this?

Alexander Gietelink Oldenziel1h20

ingular Sure! I'll try and say some relevant things below. In general, I suggest looking at Liam Carroll's distillation over Watanabe's book (which is quite heavy going, but good as a reference text). There are also some links below that may prove helpful.

The empirical loss and its second derivative are statistical estimator of the population loss and its second derivative. Ultimately the latter controls the properties of the former (though the relation between the second derivative of the empirical loss and the second derivative of the population lo... (read more)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion