LESSWRONG
LW

Ben Amitay
210Ω37660
Message
Dialogue
Subscribe

A math and computer science graduate interested in machine and animal cognition, philosophy of language, interdisciplinary ideas, etc.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Ben Amitay's Shortform
2y
2
No wikitag contributions to display.
Would You Work Harder In The Least Convenient Possible World?
Ben Amitay2y20

I seem to be the the only one who read the post that way, so probably I read my own opinions into it, but my main takeaway was pretty much that people with your (and my) values are often shamed into pretending to have other values and invent excuses for how their values are consistent with their actions, while it would be more honest and productive if we take a more pragmatic approach to cooperating around our altruistic goals.

Reply
Ben Amitay's Shortform
Ben Amitay2y10

I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)

Reply
Ben Amitay's Shortform
Ben Amitay2y10

I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:

  1. Use IRL to learn which values are consistent with the actor's behavior.
  2. When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
Reply
Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)?
Ben Amitay2y124125

It is beautiful to see that many of our greatest minds are willing to Say Oops, even about their most famous works. It may not score that many winning-points, but it does restore quite a lot of dignity-points I think.

Reply
On how various plans miss the hard bits of the alignment challenge
Ben Amitay2y21

Learning without Gradient Descent - Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.

It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.

Reply
Ethodynamics of Omelas
Ben Amitay2y10

In another comment on this post I suggested an alternative entropy-inspired expression, that I took from RL. To the best of my knowledge, to the RL context it came from FEP or active inference or at least is acknowledged to be related.

Don't know about the specific Friston reference though

Reply
Ethodynamics of Omelas
Ben Amitay2y10

I agree with all of it. I think that I through the N there because average utilitarianism is super contra intuitive for me so I tried to make it total utility.

And also about the weights - to value equality is basically to weight the marginal happiness of the unhappy more than that of the already-happy. Or when behind the vail of ignorance, to consider yourself unlucky and therefore more likely to be born as the unhappy. Or what you wrote.

Reply
Ethodynamics of Omelas
Ben Amitay2y61

I think that the thing you want is probably to maximize N*sum(u_i exp(-u_i/T))/sum(exp(-u_i/T)) or -log(sum(exp(-u_i/T))) where u_i is the utility of the Ith person, and N is the number of people - not sure which. That way you get in one limit the vail of ignorance for utility maximizers, and in the other limit the vail of ignorance of Roles (extreme risk aversion).

That way you also don't have to treat the mean utility separately.

Reply
New OpenAI Paper - Language models can explain neurons in language models
Ben Amitay2y*82

It's not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.

If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.

Reply
LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem
Ben Amitay2y10

Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with "just shut up" - but I do think that it is important:

A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice - including some versions of the planning-inside-worldmodel example.

B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to approximate reference each other explicitly in their definitions. It doesn't mean that the critic is likely to one day kill us, just that we should take it into account when we try to I do understand what is going on.

C) Specifically, it implies 2 additional non-exotic alignment failures:

  • The critic itself did not converge to be a good approximation of the value function.
  • The actor did not converge to be a thing that maximize the output of the critic, and it maximize something else instead.
Reply
Load More
3Ben Amitay's Shortform
2y
2
10Writing this post as rationality case study
2y
8
2Semantics, Syntax and Pragmatics of the Mind?
Q
2y
Q
0
12Agents synchronization
2y
1
4Training for corrigability: obvious problems?
Q
3y
Q
6
4A learned agent is not the same as a learning agent
3y
5
1A Short Intro to Humans
3y
1