i'm tammy, alignment researcher and founder of the Orthogonal alignment research organization.
i mostly work on the QACI alignment plan.
check out my alignment research here, as well as my blog and my twitter.
terrible news, sucks to see even more people speedrunning the extinction of everything of value.
reminder that if you destroy everything first you don't gain anything. everything is just destroyed earlier.
"just too chaotic" is covered by "entire past lightcone". no matter how chaotic, the past lightcone is one whole computation which you can just run naively, if you've got room. you get all decimal places, just like an agent in conway's game of life can build a complete exact simulation of its past if it's got room in the future.
(yes, maybe quantum breaks this)
i've made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i'm making progress in that direction (see the posts about blob location).
My own thinking would be that the counterfactual reasoning should be responsive to the system's overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.
the counterfactuals might be defined wrong but they won't be "under-defined". but yes, they might locate the blob somewhere we don't intend to (or insert the counterfactual question in a way we don't intend to); i've been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
on the other hand, if you're talking about the blob-locating math pointing to the right thing but the AI not making accurate guesses early enough as to what the counterfactuals would look like, i do think getting only eventual alignment is one of the potential problems, but i'm hopeful it gets there eventually, and maybe there are ways to check that it'll make good enough guesses even before we let it loose.
(cross-posted as a top-level post on my blog)
QACI and plausibly PreDCA rely on a true name of phenomena in the real world using solomonoff induction, and thus talk about locating them in a theoretical giant computation of the universe, from the beginning. it's reasonable to be concerned that there isn't enough compute for an aligned AI to actually do this. however, i have two responses:
sounds maybe kinda like a utopia design i've previously come up with, where you get your private computational garden and all interactions are voluntary.
that said some values need to come interfere into people's gardens: you can't create arbitrarily suffering moral patients, you might have to in some way be stopped from partaking in some molochianisms, etc.
i don't think determinism is incompatible with making decisions, just like nondeterminism doesn't mean my decisions are "up to randomness"; from my perspective, i can either choose to do action A or action B, and from my perspective i actually get to steer the world towards what those action lead to.
put another way, i'm a compatibilist; i implement embedded agency.
put another way, yes i LARP, and this is a world that gets steered towards the values of agents who LARP, so yay.
what i mean here is "with regards to how much moral-patienthood we attribute to things in it (eg for if they're suffering), rather than secondary stuff we might care about like how much diversity we gain from those worlds".
gwern's clippy story does a reasonable amount of the work, i think. to me, part of why convincing people of AI doom is something that often needs to be done interactively rather than having a comprehensive explanation, is that people get stuck in a variety of different ways rather than always in the same way. that and numerous cognitive biases and stuff.
to me the argument is pretty straightforward:
i have a lot more ideas about how foom will happen but alas sharing them would be a capability exfohazard.