Academic philosophers are better than average at evaluating object-level arguments for some claim. They don't seem to be very good at thinking about what rationalization in search implies about the arguments that come up. Compared to academic philosophers, rationalists strike me as especially appreciating filtered evidence and its significance to your world model.
If you find an argument for a claim easily, then even if that argument is strong, this (depending on some other things) implies that similarly strong arguments on the other side may turn up with not too much more work. Given that, you won't want to update dramatically in favor of the claim -- the powerful evidence to the contrary could, you infer, be unearthed without much more work. You learn something about the other side of the issue from how quickly or slowly the world yielded evidence in the other direction. If it's considered a social faux pas to give strong arguments for one side of a claim, then your prior about how hard it is to find strong arguments for that side of the claim will be doing a lot of the heavy lifting in fixing your world model. And so on, for the evidential consequences of other kinds of motivated search and rationalization.
In brief, you can do epistemically better than ignoring how much search power went into finding all the evidence. You can do better than only evaluating the object-level evidential considerations! You can take expended search into account, in order to model what evidence is likely hiding, where, behind how much search debt.
Modest spoilers for planecrash (Book 9 -- null action act II).
Nex and Geb had each INT 30 by the end of their mutual war. They didn't solve the puzzle of Azlant's IOUN stones... partially because they did not find and prioritize enough diamonds to also gain Wisdom 27. And partially because there is more to thinkoomph than Intelligence and Wisdom and Splendour, such as Golarion's spells readily do enhance; there is a spark to inventing notions like probability theory or computation or logical decision theory from scratch, that is not directly measured by Detect Thoughts nor by tests of legible ability at using existing math. (Keltham has slightly above-average intelligence for dath ilan, reflectivity well below average, and an ordinary amount of that spark.)
But most of all, Nex and Geb didn't solve IOUN stones because they didn't come from a culture that had already developed digital computation and analog signal processing. Or on an even deeper level - because those concepts can't really be that hard at INT 30, even if your WIS is much lower and you are missing some sparks - they didn't come from a culture which said that inventing things like that is what the Very Smart People are supposed to do with their lives, nor that Very Smart People are supposed to recheck what their society told them were the most important problems to solve.
Nex and Geb came from a culture which said that incredibly smart wizards were supposed to become all-powerful and conquer their rivals; and invent new signature spells that would be named after them forever after; and build mighty wizard-towers, and raise armies, and stabilize impressively large demiplanes; and fight minor gods, and surpass them; and not, particularly, question society's priorities for wizards. Nobody ever told Nex or Geb that it was their responsibility to be smarter than the society they grew up in, or use their intelligence better than common wisdom said to use it. They were not prompted to look in the direction of analog signal processing; and, more importantly in the end, were not prompted to meta-look around for better directions to look, or taught any eld-honed art of meta-looking.
What sequence of characters could I possibly, actually type out into a computer that would appreciably reduce the probability that everything dies?
Framed like this, writing to save the world sounds impossibly hard! Almost everything written has no appreciable effect on our world's AI trajectory. I'm sure the "savior sequence" exists mathematically, but finding it is a whole different ballgame.
In the beginning God created four dimensions. They were all alike and indistinguishable from one another. And then God embedded atoms of energy (photons, leptons, etc.) in the four dimensions. By virtue of their energy, these atoms moved through the four dimensions at the speed of light, the only spacetime speed. Thus, as perceived by any one of these atoms, space contracted in, and only in, the direction of that particular atom's motion. As the atoms moved at the speed of light, space contracted so much in the direction of the atom's motion that the dimension in that direction vanished. That left only three dimensions of space -- all perpendicular to the atom's direction of motion -- and the ghost of the lost fourth dimension, which makes itself felt as the current of time. Now atoms moving in different directions cannot share the same directional flow of time. Each takes on the particular current it perceives as the proper measure of time.
You measure only... as projected on your time and space dimensions.
--Lewis Carroll Epstein, Relativity Visualized (1997)
Past historical experience and brainstorming about human social orders probably barely scratches the possibility space. If the CEV were to weigh in on possible posthuman social orders, optimizing in part for how cool that social order is, I'd bet what it describes blows what we've seen out of the water in terms of cool factor.
(Presumably posthumans will end up reflectively endorsing interactions with one another of some description.)
Don't translate your values into just a loss function. Rather, translate them into a loss function and all the rest of a training story. Use all the tools at your disposal in your impossible task; don't tie one hand behind your back by assuming the loss function is your only lever over the AGI's learned values.
This post crystallized some thoughts that have been floating in my head, inchoate, since I read Zvi's stuff on slack and Valentine's "Here's the Exit."
Part of the reason that it's so hard to update on these 'creative slack' ideas is that we make deals among our momentary mindsets to work hard when it's work-time. (And when it's literally the end of the world at stake, it's always work-time.) "Being lazy" is our label for someone who hasn't established that internal deal between their varying mindsets, and so is flighty and hasn't precommitted to getting stuff done even if they currently aren't excited about work.
Once you've installed that internal flinch away from not working/precommitment to work anyways, though, it's hard to accept that hard work is ever a mistake, because that seems like your current mindset trying to rationalize its way out of cooperating today!
I think I finally got past this flinch/got out of running that one particular internal status race, thanks to this and the aforementioned posts.
A model I picked up from Eric Schwitzgebel.
The humanities used to be highest-status in the intellectual world!
But then, scientists quite visibly exploded fission weapons and put someone on the moon. It's easy to coordinate to ignore some unwelcome evidence, but not evidence that blatant. So, begrudgingly, science has been steadily accorded more and more status, from the postwar period on.
The Character of an Epistemic Prisoner's Dilemma
Say there are two tribes. The tribes hold fundamentally different values, but they also model the world in different terms. Each thinks members of the other tribe are mistaken, and that some of their apparent value disagreement would be resolved if the others' mistakes were corrected.
Keeping this in mind, let's think about inter-tribe cooperation and defection.
Ruling by Reference Classes, Rather Than Particulars
In the worst equilibrium, actors from each tribe evaluate political questions in favor of their own tribe, against the outgroup. In their world model, this is to a great extent for the benefit of the outgroup members as well.
But this is a shitty regime to live under when it's done back to you too, so rival tribes can sometimes come together to implement an impartial judiciary. The natural way to do this is to have a judiciary classifier rule for reference classes of situations, and to have a separate impartial classifier sort situations into reference classes.
You're locally worse off this way, but are globally much better off.