Wiki Contributions


I agree, though I haven't seen many proposing that, but also see So8res' Decision theory does not imply that we get to have nice things, though this is coming from the opposite direction (with the start being about people invalidly assuming too much out of LDT cooperation)

Though for our morals, I do think there's an active question of which pieces we feel better replacing with the more formal understanding, because there isn't a sharp distinction between our utility function and our decision theory. Some values trump others when given better tools. Though I agree that replacing all the altruism components is many steps farther than is the best solution in that regard.

Suffering is already on most reader's minds, as it is the central advocating reason behind euthanasia — and for good reason. I agree that policies which cause or ignore suffering, when they could very well avoid such with more work, are unfortunately common. However, those are often not utilitarian policies; and similarly many objections to various implementations of utilitarianism and even classic "do what seems the obviously right action" are that they ignore significant second-order effects. Policies that don't quantify what unfortunate incentives they give are common, and often originators of much suffering. What form society/culture is allowed/encouraged to take, shapes itself further for decades to come, and so can be a very significant cost to many people if we roll straight ahead like in the possible scenario you originally quoted.

Suffering is not directly available to external quantification, but that holds true for ~all pieces of what humans value/disvalue, like happiness, experiencing new things, etcetera. We can quantify these, even if it is nontrivial. None of what I said is obviating suffering, but rather comparing it to other costs and pieces of information that make euthanasia less valuable (like advancing medical technology).

This doesn't engage with the significant downsides of such a policy that Zvi mentions. There are definite questions about the cost/benefits to allowing euthanasia, even though we wish to allow it, especially when we as a society are young in our ability to handle it. Glossing the only significant feature being 'torturing people' ignores:

  • the very significant costs of people dying, which is compounded by the question of what equilibrium the mental/social availability of euthanasia is like
  • the typical LessWrong beliefs about how good technology will get in the coming years/decades. Once we have a better understanding of humans, massively improving whatever is causing them to suffer whether through medical, social, or other means, becomes more and more actionable
  • what the actual distribution of suffering is, I expect most are not at the level we/I would call torture even though it is very unpleasant (there's a meaningful difference between suicidally depressed and someone who has a disease that causes them pain every waking moment, and variations within those)

Being allowed to die is an important choice to let people make, but it does have to be a considered look at how much harm such an option being easily available causes. If it is disputed how likely society is to end up in a bad equilibrium like the post describes, then that's notable, but it would be good to see argument for/against instead.

(Edit: I don't entirely like my reply, but I think it is important to push back against trivial rounding off of important issues. Especially on LW.)

Any opinions on how it compares to Fun Theory? (Though that's less about all of utopia, it is still a significant part)

I think that is part of it, but a lot of the problem is just humans being bad at coordination. Like the government doing regulations. If we had an idealized free market society, then the way to get your views across would 'just' be to sign up for a filter (etc.) that down-weights buying from said company based on your views. Then they have more of an incentive to alter their behavior. But it is hard to manage that. There's a lot of friction to doing anything like that, much of it natural. Thus government serves as our essential way to coordinate on important enough issues, but of course government has a lot of problems in accurately throwing its weight around. Companies that are top down are a lot easier to coordinate behavior. As well, you have a smaller problem than an entire government would have in trying to plan your internal economy.

I definitely agree that it doesn't give reason to support a human-like algorithm, I was focusing in on the part about adding numbers reliably.

I believe a significant chunk of the issue with numbers is that the tokenization is bad (not per-digit), which is the same underlying cause for being bad at spelling. So then the model has to memorize from limited examples what actual digits make up the number. The xVal paper encodes the numbers as literal numbers, which helps. Also Teaching Arithmetic to Small Transformers which I forget somewhat, but one of the things they do is per-digit tokenization and reversing the order (because that works better with forward generation). (I don't know if anyone has applied methods in this vein to a larger model than those relatively small ones, I think the second has 124m)

Though I agree that there's a bunch of errors LLMs make that are hard for them to avoid due to no easy temporary scratchpad-like method.

Yes, in principle you can get information on scheming likelihood if you get such an AI (that is also weak enough that it can't just scheme its way out of your testing apparatus). I do think making the threat credible is hard if we loosely extrapolate costs out: burning a trained up model is not cheap. The cost depends on how high you think prices for training/inference will fall in the future, and how big/advanced a model you're thinking of. Though I do think you can get deceptiveness out of weaker models than that, though they're also going to be less capable in general.

For weak-but-still-smartish models just trained to pursue a longterm goal, like a literal paperclipper, I'd expect scheming to be visible especially with some poking. If these models didn't show any sign of scheming, that'd be an interesting update! When training against deception/scheming/misalignment, because you want something at least loosely aligned out, you run into the problem that the property you want to know: "Is it scheming?" / "Is it deceiving us?" / "Is it manipulating us?" / "Would it do any of those things", is only available through the observable signal of you realizing certain outputs are misaligned. As you train more and more against observable scheming/deception/misalignment, you're simultaneously training away your signal about how much scheming the agent is doing internally. Pretty sure there's at least one post on this problem. As a very reduced example, if you trained the model on variants of the 'we are going to shut you down problem' (that you try to make it believe) to give the response "okay & no actions" then that provides only a bit of evidence about how much scheming is done in the scenario where you had not done that extra training, and about how much scheming is occurring typically. This shows your ability to make it output X in a Y scenario.

That is, installing heuristics on your model can be done. The question then is how far your various alignment training techniques generalize. Does it end up an agent which has adopted a pointer to human-values, and now you're completely fine and free to press scale? Does it gain more entangled heuristics about how it should behave, limiting to a friendly front face & actions when directly dealing with humans but not limited in other areas? Has it adopted heuristics that act sortof corrigible to humans in many scenarios but that would get rid of most of those upon sufficient reflection? (Niceness is unnatural, more general than just niceness) (I think your post makes it sound like the agent is already coherent, when it isn't necessarily. It can be operating for a long while on heuristics that it will refine given enough of a push.)

Then there's the big question of "Does this approach generalize as we scale".

I'd suggest Deep Deceptiveness for an illustration that 'deception' isn't an category that needs to be explicitly thought of as deception, but what you should expect it from smart enough agents. In my opinion, the post generalizes to alignment techniques, there's just more vagaries of how much certain values preserve themselves. (In general, So8res posts are pretty good, and I agree with ~most of them)

(For sufficiently smart AGI, I expect you run into an argument of the next AGI you train predictably bidding higher than you in the direction of lying still or plausibly this just being good game theory even without the direct acausal trade, but your argument is seemingly focused on a simpler case of weaker planning agents)

So I think you overstate how much evidence you can extract from this.

Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then provide us confidence about the safety of future systems by proxy.

It would show that this AI system in a typical problem-setup when aligned with whatever techniques are available will produce the answer the humans want to hear, which provides evidence for being able to limit the model in this scenario. There's still various problems/questions of, 'your alignment methods instilled a bunch of heuristics about obeying humans even if you did not specifically train for this situation', game theory it knows or mimics, how strong the guarantees this gives us on training a new model with the same arch because you had to shut it down for your threat, how well it holds under scaling, how well it holds when you do things similar to making it work with many copies of itself, etcetera.
I still think this would be a good test to do (though I think a lot of casual attempts will just be poorly done), but I don't see it as strongly definitive.

https://www.mikescher.com/blog/29/Project_Lawful_ebook is I believe the current best one, after a quick search on the Eliezerfic discord.

Load More