i blog at carado.moe.
the counterfactuals might be defined wrong but they won't be "under-defined". but yes, they might locate the blob somewhere we don't intend to (or insert the counterfactual question in a way we don't intend to); i've been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
on the other hand, if you're talking about the blob-locating math pointing to the right thing but the AI not making accurate guesses early enough as to what the counterfactuals would look like, i do think getting only eventual alignment is one of the potential problems, but i'm hopeful it gets there eventually, and maybe there are ways to check that it'll make good enough guesses even before we let it loose.
(cross-posted as a top-level post on my blog)
QACI and plausibly PreDCA rely on a true name of phenomena in the real world using solomonoff induction, and thus talk about locating them in a theoretical giant computation of the universe, from the beginning. it's reasonable to be concerned that there isn't enough compute for an aligned AI to actually do this. however, i have two responses:
sounds maybe kinda like a utopia design i've previously come up with, where you get your private computational garden and all interactions are voluntary.
that said some values need to come interfere into people's gardens: you can't create arbitrarily suffering moral patients, you might have to in some way be stopped from partaking in some molochianisms, etc.
i don't think determinism is incompatible with making decisions, just like nondeterminism doesn't mean my decisions are "up to randomness"; from my perspective, i can either choose to do action A or action B, and from my perspective i actually get to steer the world towards what those action lead to.
put another way, i'm a compatibilist; i implement embedded agency.
put another way, yes i LARP, and this is a world that gets steered towards the values of agents who LARP, so yay.
what i mean here is "with regards to how much moral-patienthood we attribute to things in it (eg for if they're suffering), rather than secondary stuff we might care about like how much diversity we gain from those worlds".
(this answer is cross-posted on my blog)
here is a list of problems which i seek to either resolve or get around, in order to implement my formal alignment plans, especially QACI:
i would love a world-saving-plan that isn't "a clever scheme" with "many moving parts" but alas i don't expect it's what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i've heard of.
Only a single training example needed through use of hypotheticals.
(to be clear, the question and answer serve less as "training data" meant to represent the user, but as "IDs" or "coordinates" menat to locate the user in past-lightcone.)
We need good inner alignment. (And with this, we also need to understand hypotheticals).
this is true, though i think we might not need a super complex framework for hypotheticals. i have some simple math ideas that i explore a bit here, and about which i might write a bunch more.
for failure modes like the user getting hit by a truck or spilling coffee, we can do things such as at each step asking not 1 cindy the question, but asking 1000 cindy's 1000 slight variations on the question, and then maybe have some kind of convolutional network to curate their answers (such as ignoring garbled or missing output) and pass them to the next step, without ever relying on a small number of cindy's except at the very start of this process.
it is true that weird memes could take over the graph of cindy's; i don't have an answer to that apart that it seems sufficiently not likely to me that i still think this plan has promise.
Chaos theory. Someone else develops a paperclip maximizer many iterations in, and the paperclip maximizer realizes it's in a simulation, hacks into the answer channel and returns "make as many paperclips as possible" to the AI.
hmm. that's possible. i guess i have to hope this never happens on the question-interval, on any simulation day. alternatively, maybe the mutually-checking graph of a 1000 cindy's can help with this? (but probly not; clippy can just hack the cindy's).
So all the virtual humans get saved on disk, and then can live in the utopia. Hey, we need loads of people to fill up the dyson sphere anyway.
yup. or, if the QACI user is me, i'm probly also just fine with those local deaths; not a big deal compared to an increased chance of saving the world. alternatively, instead of being saved on disk, they can also just be recomputed later since the whole process is deterministic.
I am not confident that your "make it complicated and personal data" approach at the root really stops all the aliens doing weird acausal stuff.
yup, i'm not confident either. i think there could be other schemes, possibly involving cryptography in some ways, to entangle the answer with a unique randomly generated signature key or something like that.
i've made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i'm making progress in that direction (see the posts about blob location).
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.