I seem to be the the only one who read the post that way, so probably I read my own opinions into it, but my main takeaway was pretty much that people with your (and my) values are often shamed into pretending to have other values and invent excuses for how their values are consistent with their actions, while it would be more honest and productive if we take a more pragmatic approach to cooperating around our altruistic goals.

Ben Amitay's Shortform

Ben Amitay1y10

I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)

Ben Amitay's Shortform

Ben Amitay1y10

I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:

Use IRL to learn which values are consistent with the actor's behavior.
When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)

Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)?

Ben Amitay1y121113

It is beautiful to see that many of our greatest minds are willing to Say Oops, even about their most famous works. It may not score that many winning-points, but it does restore quite a lot of dignity-points I think.

On how various plans miss the hard bits of the alignment challenge

Ben Amitay1y21

Learning without Gradient Descent - Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.

It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.

Ethodynamics of Omelas

Ben Amitay1y10

In another comment on this post I suggested an alternative entropy-inspired expression, that I took from RL. To the best of my knowledge, to the RL context it came from FEP or active inference or at least is acknowledged to be related.

Don't know about the specific Friston reference though

Ethodynamics of Omelas

Ben Amitay1y10

I agree with all of it. I think that I through the N there because average utilitarianism is super contra intuitive for me so I tried to make it total utility.

And also about the weights - to value equality is basically to weight the marginal happiness of the unhappy more than that of the already-happy. Or when behind the vail of ignorance, to consider yourself unlucky and therefore more likely to be born as the unhappy. Or what you wrote.

Ethodynamics of Omelas

Ben Amitay1y61

I think that the thing you want is probably to maximize N*sum(u_i exp(-u_i/T))/sum(exp(-u_i/T)) or -log(sum(exp(-u_i/T))) where u_i is the utility of the Ith person, and N is the number of people - not sure which. That way you get in one limit the vail of ignorance for utility maximizers, and in the other limit the vail of ignorance of Roles (extreme risk aversion).

That way you also don't have to treat the mean utility separately.

New OpenAI Paper - Language models can explain neurons in language models

Ben Amitay1y82

It's not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.

If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.

LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem

Ben Amitay1y10

Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with "just shut up" - but I do think that it is important:

A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice - including some versions of the planning-inside-worldmodel example.

B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to approximate reference each other explicitly in their definitions. It doesn't mean that the critic is likely to one day kill us, just that we should take it into account when we try to I do understand what is going on.

C) Specifically, it implies 2 additional non-exotic alignment failures:

The critic itself did not converge to be a good approximation of the value function.
The actor did not converge to be a thing that maximize the output of the critic, and it maximize something else instead.