I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.
We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situa...
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough...
Sorry, I meant "scope-insensitive," and really I just meant an even broader category of like "doesn't care 10x as much about getting 10x as much stuff." I think discount rates or any other terminal desire to move fast would count (though for options like "survive in an unpleasant environment for a while" or "freeze and revive later" the required levels of kindness may still be small).
(A month seems roughly right to me as the cost of not trashing Earth's environment to the point of uninhabitability.)
I'd guess "most humans survive" vs. "most humans die" probabilities don't correspond super closely to "presence of small pseudo-kindness". Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
Yeah, I think that:
Yeah, I think "no control over future, 50% you die" is like 70% as alarming as "no control over the future, 90% you die." Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in "do people really believe this could happen?" or other inputs into decision-making. I think it's correct to summarize as "practically as alarming."
I'm not sure what you want engagement with. I don't think the much worse outcomes are closely related to unaligned AI so I don't think they seem super relevant to my...
I would also call this one for Eliezer. I think we mostly just retrain AI systems without reusing anything. I think that's what you'd guess on Eliezer's model, and very surprising on Robin's model. The extent to which we throw things away is surprising even to a very simple common-sense observer.
I would have called "Human content is unimportant" for Robin---it seems like the existing ML systems that are driving current excitement (and are closest to being useful) lean extremely heavily on imitation of human experts and mostly don't make new knowledge thems...
My objection is that the simplified message is wrong, not that it's too alarming. I think "misaligned AI has a 50% chance of killing everyone" is practically as alarming as "misaligned AI has a 95% chance of killing everyone," while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It's unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn't be doubling down on them in the position in argument.
I don't think misaligned AI drives the majority of s-risk (...
I regret mentioning "lie-to-children" as it seems a distraction from my main point. (I was trying to introspect/explain why I didn't feel as motivated to express disagreement with the OP as you, not intending to advocate or endorse anyone going into "the business of telling lies-told-to-children to adults".)
My main point is that I think "misaligned AI has a 50% chance of killing everyone" isn't alarming enough, given what I think happens in the remaining 50% of worlds, versus what a typical person is likely to infer from this statement, especially after se...
As I said:
I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
I think it's totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don't believe it's because AI doesn't care at all one way or the other (such that you should make predictions based on instrumental reasoning like "the AI will kill humans because it's the easiest way to avoid future conflict" or other relatively small considerations).
To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.
I think a closer summary is:
...Humans and AI systems probably want different things. From the human perspective, it would be better if the universe was determined by what the humans wanted. But we shouldn't be willing to pay huge costs, and shouldn't attempt to create a slave society where AI systems do humans' bidding forever, just to ensure that human values win out. After all, we really wouldn't want that outcome if our situations had been reversed. And indeed we are the beneficiary of similar values-turnover in the past, as our ancestors have been open (p
We're not talking about practically building minds right now, we are talking about humans.
We're not talking about "extrapolating volition" in general. We are talking about whether---in attempting to help a creature with preferences about as coherent as human preferences---you end up implementing an outcome that creature considers as bad as death.
For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about ...
Hypothesis 1 is closer to the mark, though I'd highlight that it's actually fairly unclear what you mean by "cosmopolitan values" or exactly what claim you are making (and that ambiguity is hiding most of the substance of disagreements).
I'm raising the issue of pico-pseudokindness here because I perceive it as (i) an important undercurrent in this post, (ii) an important part of the actual disagreements you are trying to address. (I tried to flag this at the start.)
More broadly, I don't really think you are engaging productively with people who disagree wi...
I disagree with this but am happy your position is laid out. I'll just try to give my overall understanding and reply to two points.
Like Oliver, it seems like you are implying:
Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we'd achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as "murder everyone" is to us.
I think that normal people being pseudokind in a common-sensical way would instead say:
...If we are trying to help some creatures, but tho
For the first half, can you elaborate on what 'actual emotional content' there is in this post, as opposed to perceived emotional content?
I mean that if you tell a story about the AI or aliens killing everyone, then the valence of the story is really tied up with the facts that (i) they killed everyone, and weren't merely "not cosmopolitan," (ii) this is a reasonably likely event rather than a possibility.
...My best guess for the second half is that maybe the intended meaning was: 'this particular post looks wrong in an important way (relating to the 'actual
I think some of the confusion here comes from my using "kind" to refer to "respecting the preferences of existing weak agents," I don't have a better handle but could have just used a made up word.
I don't quite understand your objection to my summary---it seems like you are saying that notions like "kindness" (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up...
Eliezer has a longer explanation of his view here.
My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn't have update too much from humans, and there is an important background assumption that ki...
Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:
I'm not quite sure what argument you're trying to have here. Two explicit hypotheses follow, that I haven't managed to distinguish between yet.
Background context, for establishing common language etc.:
Short version: I don't buy that humans are "micro-pseudokind" in your sense; if you say "for just $5 you could have all the fish have their preferences satisfied" I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.
Meta:
...Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decisio
Is this a fair summary?
Humans might respect the preferences of weak agents right now, but if they thought about it for longer they'd pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.
If so, it seems like you wouldn't be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be a...
I want to keep picking a fight about “will the AI care so little about humans that it just kills them all?” This is different from a broader sense of cosmopolitanism, and moreover I'm not objecting to the narrow claim "doesn't come for free." But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.
(Note: I believe that AI takeover has a ~50% probability o...
Paul, this is very thought provoking, and has caused me to update a little. But:
I loathe factory-farming, and I would spend a large fraction of my own resources to end it, if I could.
I believe that makes me unusually kind by human standards, and by your definition.
I like chickens, and I wish them well.
And yet I would not bat an eyelid at the thought of a future with no chickens in it.
I would not think that a perfect world could be improved by adding chickens.
And I would not trade a single happy human soul for an infinity of happy chickens.
I think that your single known example is not as benevolent as you think.
Eliezer has a longer explanation of his view here.
My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn't have update too much from humans, and there is an important background assumption that ki...
Might write a longer reply at some point, but the reason why I don't expect "kindness" in AIs (as you define it here) is that I don't expect "kindness" to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent).
Like, different versions of kindness might or might not put almost all of their conside...
I agree that it seems very bad if we build AI systems that would "prefer" to tamper with sensors (including killing humans if necessary) but are prevented from doing so by physical constraints.
I currently don't see how to approach value learning (in the worst case) without solving something like ELK. If you want to take a value learning perspective, you could view ELK as a subproblem of the easy goal inference problem. If there's some value learning approach that routes around this problem I'm interested in it, but I haven't seen any candidates and have spent a long time talking with people about it.
I think it was confusing for me to use "correlation" to refer to a particular source of correlation. I probably should have called it something like "similarity." But I think the distinction is very real and very important, and crisp enough to be a natural category.
More precisely, I think that:
Alice and Bob are correlated because Alice is similar to Bob (produced by similar process, running similar algorithm, downstream of the same basic truths about the universe...)
is qualitatively and crucially different from:
...Alice and Bob are correlated because Alice is
I think you have to care about what happens to other agents. That might be "other paperclippers."
If you only care about what happens to you personally, then I think the size of the universe isn't relevant to your decision.
If safety literally came down to sensor hardening, I do think cryptographic mechanisms (particularly tamper-proof hardware with cryptographic secrets that destroys itself if it detects trouble) seem like a relevant tool, and it's quite plausibly you could harden sensors even against wildly superhuman attackers.
It's an insane-feeling scenario---holistically I doubt it will matter for a variety of reasons, and from a worst-case perspective it's still not something you can rely on---but I do think there's some value in pinning these things down.
(In this parti...
I'm most convinced by the second sentence:
They would have to be specifically designed to do so.
Which definitely seems to be dismissing the possibility of alignment failures.
My guess would be that he would back off of this claim if pushed on it explicitly, but I'm not sure. And it is at any rate indicative of his attitude.
We tried talking about AI Alignment, and that’s also not going so great.
Eliezer defined AI alignment as "the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world."
If you try to use this definition, and other people use AI alignment to refer to things that they think are relevant to whether advanced AI produces good outcomes, you can't really object as a matter of linguistics. I have no sympathy and am annoyed at the complaining. You can see the last LW discu...
That we're probably bottlenecked on search algorithms, rather than compute power/model-size. This would have policy implications.
If a model can't carry out good enough reasoning to solve IMO problems, then I think you should expect a larger gap between the quality of LM thinking and the quality of human thinking. This suggests that we need bigger models to have a chance of automating challenging tasks, even in domains with reasonably good supervision.
Why would failure to solve the IMO suggest that search is the bottleneck?
"Prove or disprove X" is only like 2x harder than "Prove X." Sometimes the gap is larger for humans because of psychological difficulties, but a machine can literally just pursue both in parallel. (That said, research math involves a ton of problems other than prove or disprove.)
I basically agree that IMO problems are significantly easier than research math or other realistic R&D tasks. However I think that they are very much harder than the kinds of test questions that machines have solved so far. I'm not sure the difference is about high school math ...
No. You should think of reversible computers as quantitatively more efficient than normal computers, but basically the same thing. If you have polynomial time and energy then reversible computers still implement P.
My post observes that if the universe had a long enough lifetime, then a reversible computer would be able to compute PSPACE even if you only started out with polynomially much negentropy (whereas a normal computer would quickly burn up all the negentropy and leave you with a universe too hot to compute). So time and space and durability might be...
Yeah, I can believe the problem is NP hard if you need to distinguish eigenvalues in time rather than . I don't understand exact arithmetic very well but it seems wild (e.g my understanding is that no polynomial time algorithm is known for testing ).
Agree about the downside of fragmented discussion, I left some time for MO discussion before posting the bounty but I don't know how much it helped.
I agree that some approach like that seems like it could work for . I'm not familiar with those techniques and so don't know a technique that needs iterations rather than or . For Newton iteration it seems like you need some kind of bound on the condition number.
But I haven't thought about it much, since it seems like something fundamentally different is ne...
It can be decided in time by semidefinite programming, and I'd guess that most researchers would expect it can be done in time (which is the runtime of PSD testing, i.e. the case when ).
More generally, there's a large literature on positive semidefinite completions which is mostly concerned with finding completions with particular properties. Sorry for not better flagging the state of this problem. Those techniques don't seem super promising for getting to .
If you are given the dot products exactly, that's equivalent to being given the distances exactly.
(I think this reduction goes the wrong way though, if I wanted to solve the problem about distances I'd translate it back into dot products.)
I changed the statement to assume the diagonal is known, I agree this way is cleaner and prevents us from having to think about weird corner cases.
I don't think that anyone has ever mentioned problem #2 in the literature and so I would at least consider that answer somewhat misleading.
I strongly suspect that if we put in a similar-sounding but very easy problem GPT-4 would be just as likely to say that it's an open question.
I agree with your example. I think you can remove the row/column if there are no known 0s on the diagonal, and I also think you can first WLOG remove any rows and columns with 0s on the diagonal (since if there are any other known non-zero entries in their row or column then the matrix has no PSD completions).
I'm also happy to just assume the diagonal entries are known. I think these are pretty weird corner cases.
I agree you need feedback from the world; you need to do experiments. If you wanted to get a 50% chance of launching a rocket successfully on the first time (at any reasonable cost) you would need to do experiments.
The equivocation between "no opportunity to experiment" and "can't retry if you fail" is doing all the work in this argument.
A more reasonable guess is that expected odds of the first Starship launch failure go down logarithmically with budget and time.
That's like saying that it takes 10 people to get 90% reliability, 100 people to get to 99% reliability, and a hundred million people to get to 99.99% reliability. I don't think it's a reasonable model though I'm certainly interested in examples of problems that have worked out that way.
Linear is a more reasonable best guess. I have quibbles, but I don't think it's super relevant to this discussion. I expect the starship first failure probability was >>90%, and we're talking about the difficulty of getting out of that regime.
There are many industries where it is illegal to do things in the fastest or easiest way. I'm not exactly sure what you are saying here.
I think Eliezer's tweet is wrong even if you grant the rocket <> alignment analogy (unless you grant some much more extreme background views about AI alignment).
Assume that "deploy powerful AI with no takeover" is exactly as hard as "build a rocket that flies correctly the first time even though it has 2x more thrust than anything anyone as tested before." Assume further that an organization is able to do one of those tasks if and only if it can do the other.
Granting the analogy, the relevant question is how much harder it would be to successfully la...
It's unclear whether some people being cautious and some people being incautious leads to an AI takeover.
In this hypothetical, I'm including AI developers selling AI systems to law enforcement and militaries, which are used to enforce the law and win wars against competitors using AI. But I'm assuming we wouldn't pass a bunch of new anti-AI laws (and that AI developers don't become paramilitaries).
I think the substance of my views can be mostly summarized as:
I don't think my credences add very much except as a way of quantifying that basic stance. I largely made this post to avoid confusion after quoting a few numbers on a podcast and seeing some people misinterpret them.
Consider a competent policy that wants paperclips in the very long run. It could reason "I should get a low loss to get paperclips," and then get a low loss. As a result, it could be selected by gradient descent.
I think almost all the cumulative takeover probability is within 10 years of building powerful AI. Didn't draw the distinction here, but my views aren't precise enough to distinguish.
Actually I think my view is more like 50% from AI systems built by humans (compared to 15% unconditionally), if there is no effort to avoid takeover.
If you continue assuming "no effort to avoid takeover at all" into the indefinite future then I expect eventual takeover is quite likely, maybe more like 80-90% conditioned on nothing else going wrong, though in all these questions it really matters a lot what exactly you mean by "no effort" and it doesn't seem like a fully coherent counterfactual.
These predictions are not very related to any alignment research that is currently occurring. I think it's just quite unclear how hard the problem is, e.g. does deceptive alignment occur, do models trained to honestly answer easy questions generalize to hard questions, how much intellectual work are AI systems doing before they can take over, etc.
I know people have spilled a lot of ink over this, but right now I don't have much sympathy for confidence that the risk will be real and hard to fix (just as I don't have much sympathy for confidence that the problem isn't real or will be easy to fix).
These are all-things-considered estimates, including the fact that we will do our best to prevent AI risk.
The probability of human survival is primarily driven by AI systems caring a small amount about humans (whether due to ECL, commonsense morality, complicated and messy values, acausal trade, or whatever---I find all of those plausible).
I haven't thought deeply about this question, because a world where AI systems don't care very much about humans seems pretty bad for humans in expectation. I do think it matters whether the probability we all literally die is 10% or 50% or 90%, but it doesn't matter very much to my personal prioritization.
I think these questions are all still ambiguous, just a little bit less ambiguous.
I gave a probability for "most" humans killed, and I intended P(>50% of humans killed). This is fairly close to my estimate for E[fraction of humans killed].
I think if humans die it is very likely that many non-human animals die as well. I don't have a strong view about the insects and really haven't thought about it.
In the final bullet I implicitly assumed that the probability of most humans dying for non-takeover reasons shortly after building AI was very similar to the probability of human extinction; I was being imprecise, I think that's kind of close to true but am not sure exactly what my view is.
I'm somewhat optimistic that AI takeover might not happen (or might be very easy to avoid) even given no policy interventions whatsoever, i.e. that the problem is easily enough addressed that it can be done by firms in the interests of making a good product and/or based on even a modest amount of concern from their employees and leadership. Perhaps I'd give a 50% chance of takeover with no policy effort whatsoever to avoid it, compared to my 22% chance of takeover with realistic efforts to avoid it.
I think it's pretty hard to talk about "no policy effort w...
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough... (read more)