Flexible. When an S-process round starts, there's an estimate about how much will be allocated in total, but funders (usually Jaan, sometimes others), might ultimately decide to give more or less, depending on both the quality of the applications and the quality of the analysis by the recommenders in the S-process.
I also had this initial misreading, but I think I figured out what you meant on my first reread.
Thanks, fixed.
I work part time for SFF.
It seems plausible that donations right now could actually cause counterfactual matching
Can you elaborate on what you mean by this?
These matching funds are intended to be counterfactual, and I think they are pretty counterfactual.
If MIRI don't fundraise to match the SFF matching dollars, the the SFF matching dollars set aside for MIRI is just returned to Jaan.
There's a more complicated question about how much SFF would have donated to MIRI if MIRI had not requested matching funds. My personal guess is "less, but not a lot less", for this round (though this will probably be different in future rounds—SFF, as an institution, wants to set up a system that rewards applicants for asking for matching funds, because that allows it to partially defer to the judgement of other funders, and to invest in a stronger more diversified funding ecosystem.)
Also, I think that the negative signal of people being uninterested in supporting MIRI will tend to make SFF less enthusiastic about granting to MIRI in the future, though the size of that effect is unclear.
Though we're not actually done yet, because we can invert these probabilities to get my estimate for the classic paperclip scenario where nothing of value is retained:
>>> (1 - conscious) * (1 - each_other) * (1 - care_other) * (1 - bored) * (1 - avoid_wireheading) * (1 - active_learning)
3.999999999999999e-07Hold on that's seven zeroes right?
>>> 0.00000039
3.9e-07
The basic point of Value Is Fragile is that if you lose any one of these essential properties, the value of the resulting universe drops to near-0.
Value isn’t just complicated, it’s fragile. There is more than one dimension of human value, where if just that one thing is lost, the Future becomes null. A single blow and all value shatters. Not every single blow will shatter all value—but more than one possible “single blow” will do so.
Personally, I don't actually buy that. (For starters, I think a universe of trillions of identical beings replaying the same hyperoptimized meaning/bliss state is at least better than nothing. I think I could be convinced that it's better than the transhuman galactic economy of delightfully strange beings that Eliezer is imagining.)
But, insofar as one buys the "value is fragile" thesis, the probability of the conjunction of each of the these dimensions "going the wrong way" isn't the relevant thing to estimate. You should care about the probability of the disjunction, instead.
Given that assumption, and your own estimates, the probability of a valueless or near valueless AI future is 0.5381056. (eg 1 - the probability of all the dimensions "going right")
(I'll also observe that the simple multiplication assumes that these are all independent, which seems unlikely, but is probably fine for the granularity of analysis that you're doing here?)
So the most hopeful thing that could happen is for nature to throw up a big wall that says "nope, you need to thoroughly solve reward hacking before you can have generalist agents" which requires us to solve the problem first. The worst thing that could happen is for the problem to not appear in the kind of way where you have to robustly solve it until after the agent is superintelligent and you suddenly get a lethal sharp left turn. So even if everything I've just said about how to solve the problem is very wrong, that it's encountered as an otherwise impassable barrier so early... [emphasis mine]
It sounds like you're saying "in order to make functional agents at all, we need to solve the goodhart/ wireheading problem in the robust and general way, not just with kludgy patches." I don't get how this follows from the rest of the section, though.
Couldn't AI developers implement a series of non-generalizing "baby gates", each patching a specific early reward hack behavior, until an AI is sophisticated enough to throughly hide that it's strategically pursuing a misaligned goal?
Is the claim that if we took that approach, the resulting agents would not be effective, because there are too many available reward hacks derailing the agent, and so we can't realistically patch them all?
If AI successors are intrinsically motivated to "have fun" like we are, then it seems obvious that they would instrumentally want to have fun together.
Is this right?
Is the implied premise that beings can have more fun if they cooperate at it? Multiplayer games are more fun than single player games?
Since most human behaviors are the product of instrumental convergence (especially in modernity, which is out of distribution to the ancestral environment), our null hypothesis should be that a given behavior is instrumentally convergent unless we have a good reason to suspect it isn't
This doesn't seem quite right to me.
Even taking the argument on it's on terms (I'm not sure how much of behavior in the modern world I expect to be due to instrumental convergence, as opposed to just "not adaptive"), it seems like we should say that most behavior is either instrumentally convergent or a spandrel of some other instrumentally convergent behavior.
If we're trying to figure out what properties future AIs will have, that second part of the disjunction matter a lot, since it seems likely that future AIs will get functionally similar behavior via different mechanism that will not produce the same spandrels.
Informative comment for me. Thank you.
I found both of these posts helpful for me, despite being ~10 years older than you guys. Reading how people are engaging with the situation emotionally, somehow supports my own emotional engagement with what's going on.