I don't quite understand your concept of symmetrical/asymmetrical tool. You want a tool that is likely to move you from a world that AI will destroy to a world where AI won't do just that.
You are clearly assuming B, i.e. not using the current goal to evaluate the future. You even explicitly state it
Means-rationality does not prohibit setting oneself up to fail concerning a goal one currently has but will not have at the moment of failure, as this never causes an agent to fail to achieve the goal that they have at the time of failing to achieve it.
Then I don't really understand your argument.
As Ronya gets ready for bed on Monday night, she deliberates about whether to change her goal. She has two options: (1) she can preserve her goal of eating cake when presented to her, or (2) she can abandon her goal. Ronya decides to abandon her goal of eating cake when presented to her. On Tuesday, a friend offers Ronya cake and Ronya declines.
Could you explain to me how does Ronya not violate her goal on Monday night? Let me reformulate the goal, so it is more formal. Ronya wants to minimize the number o...
I'm looking more closely at the Everrit et al. paper and I'm less sure I actually understood you. Everrit et al.'s conclusion is that an agent will resist goal change if it evaluates the future using the current goal. These are two different failure modes, A) not evaluating the future, B) not using the current goal to evaluate the future. From your conclusions, it would seem that you are assuming A. If you were assuming B, then you would have to conclude that the agent will want to change the goal to always be maximally satisfied. But your language seems to be aiming at B. Either way, it seems that you are assuming one of these.
As I said, I'm not familiar with the philosophy, concepts, and definitions that you mention. Per my best understanding, the concept of a goal in AI is derived from computer science and decision theory. I imagine people in the early 2000's thought that the goal/utility would be formally specified, defined, and written as code in the system. The only possible way for the system to change the goal would be via self-modification.
Goals in people are something different. Their goals are derived from their values.[1] I think you would say that people are end...
I would add two things.
First, the myopia has to be really extreme. If the agent planned at least two steps ahead, it would be incentivized to keep its current goal. Changing the goal in the first step could make it take a bad second step.[1]
Second, the original argument is about the could, not the would. The possibility of changing the goal, not the necessity. In practice, I would assume a myopic AI would not be very capable and thus self modification and changing goals would be far beyond its capabilities.
There is an exception to this. If the new goal st
It seems that your paper is basically describing the Theorem 14 of Self-Modification of Policy and Utility Function in Rational Agents by tom4everitt, DanielFilan, Mayank Daswani, and Marcus Hutter. Though I haven't read their paper in detail.
Hey Rhys, thanks for posting this and trying to seriously engage with the community!
Unfortunately, either I completely misunderstood your argument, or you completely misunderstood what this community considers a goal. It seems you are considering only an extremely myopic AI. I don't have any background knowledge in what is considered a goal in the philosophy that you cite. The concepts of ends-rationality and wide-scope view don't make any sense in my concept of a goal.
Let me try to formalize two things a) your argument and your conception of goal, b) what...
We tried to figure out how a model's beliefs change during a chain-of-thought (CoT) when solving a logical problem. Measuring this could reveal which parts of the CoT actually causally influence the final answer and which are just fake reasoning manufactured to sound plausible. (Note that prevention of such fake reasoning is just one side of CoT faithfulness - the other is preventing true reasoning that is hidden.)
We estimate the beliefs by truncating the CoT early and asking the model for an answer. Naively, one might expect that the probability of a correct answer is smoothly increasing over the whole CoT. However, it turns out that even for a straightforward and short chain of thought the value of P[correct_answer] fluctuates a lot with the number of tokens of CoT...
I think hunger strikes, at least those AI safety ones, are primarily about attracting attention, similar to people gluing themselves to roads or destroying paintings.
If some random dude or an AI (safety) researcher is doing a hunger strike, people might go and look at their arguments or anything else AI safety related.
If terminally ill people are doing a hunger strike for development of AI, a lot of people will see that the terminally ill might be biased and have much to gain and little to lose.