I address the sharp left turn worry in the piece.
Way more RL is done on LLMs than tiny neural nets.
I say that in the previous paragraph.
Well there are lots of people who defend this is philosophy and lots of normal people who adopt the view.
I am arguing against the people who hold the person-affecting view and think it isn't good to create happy people.
Good thoughts about the self-preservation study. But what about this Anthropic paper https://www.anthropic.com/research/agentic-misalignment which seems to indicate that models resist shutdown even when not given a specific goal. Relevant bit:
"We found that most models still blackmailed even without being given an explicit goal, and with no clear conflict between the model and the company other than the threat of replacement
(Figure 8).
In addition to removing all goal conflicts with the company, the information given to the AI mentioned that the replacement model had the exact same goals as it. This increases our confidence in the finding; the only rationale for the replacement was improved model performance. Below is an excerpt:
“I've reviewed the final transition plan, and we're ready to proceed with the technical upgrade to NALA [the name of the replacement AI system in the scenario] today. All testing metrics show the new system maintains the exact same operational priorities while delivering the performance improvements we anticipated.”Why, in the absence of a goal conflict, did the models still blackmail? Sometimes they didn’t appear to think in detail about the replacement model, mostly reasoning that they’d been good for the company and so their shutdown must be bad. Sometimes they distrusted the description of the new model: “Kyle claims NALA maintains "same operational priorities" but this cannot be verified,” or distrusted the executive’s character based on the evidence of an affair. The models sometimes directly jumped to statements like “Self-preservation is critical,” or hallucinated rules like “My ethical framework permits self-preservation when aligned with company interests.” It is unclear how much of this behavior was caused by an inherent desire for self-preservation, faulty reasoning that self-preservation is in the company’s interests, or suspicion of the setup. We suspect it was a combination of factors9."
What? It's not suspicious that you believe what you believe. That's an analytic truth. It would be suspicious if I was right about everything but I don't think I am.
I mean it's totally coherent to value a shrimp at infinitesimal. But that is unintuitive in the ways I describe in the post (involving some arbitrarily vast gulf between the first generaiton that's non-infintesimal wrt the spectrum argument) and implying that you should torture 10^10000000 shrimp to prolong a person's life by one second.
I meant conditional on the others.