All of paulfchristiano's Comments + Replies

Discussing the application of heuristic estimators to adversarial training:

Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C.  For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.

You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough... (read more)

I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.

We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situa... (read more)

Discussing the application of heuristic estimators to adversarial training:

Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C.  For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.

You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough... (read more)

Sorry, I meant "scope-insensitive," and really I just meant an even broader category of like "doesn't care 10x as much about getting 10x as much stuff."  I think discount rates or any other terminal desire to move fast would count (though for options like "survive in an unpleasant environment for a while" or "freeze and revive later" the required levels of kindness may still be small).

(A month seems roughly right to me as the cost of not trashing Earth's environment to the point of uninhabitability.)

I'd guess "most humans survive" vs. "most humans die" probabilities don't correspond super closely to "presence of small pseudo-kindness". Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.

Yeah, I think that:

  •  "AI doesn't care about humans at all so kills them incidentally" is not most of the reason that AIs may kill humans, and my bottom line 50% probability of AI killing us also includes the other paths (AI caring a bit but failing to
... (read more)
1denkenberger11d
Could you say more what you mean? If the AI has no discount rate, leaving Earth to the humans may require within a few orders of magnitude 1/trillion kindness. However, if the AI does have a significant discount rate, then delays could be costly to it. Still, the AI could make much more progress in building a Dyson swarm from the moon/Mercury/asteroids with their lower gravity and no atmosphere, allowing the AI to launch material very quickly. My very rough estimate indicates sparing Earth might only delay the AI a month from taking over the universe. That could require a lot of kindness if they have very high discount rates. So maybe training should emphasize the superiority of low discount rates?

Yeah, I think "no control over future, 50% you die" is like 70% as alarming as "no control over the future, 90% you die." Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in "do people really believe this could happen?" or other inputs into decision-making. I think it's correct to summarize as "practically as alarming."

I'm not sure what you want engagement with. I don't think the much worse outcomes are closely related to unaligned AI so I don't think they seem super relevant to my... (read more)

1denkenberger11d
I think "50% you die" is more motivating to people than "90% you die" because in the former, people are likely to be able to increase the absolute chance of survival more, because at 90%, extinction is overdetermined.
4Wei Dai14d
In your initial comment you talked a lot about AI respecting the preferences of weak agents (using 1/trillion of its resources) which implies handing back control of a lot of resources to humans, which from the selfish or scope insensitive perspective of typical humans probably seems almost as good as not losing that control in the first place. If people think that (conditional on unaligned AI) in 50% of worlds everyone dies and the other 50% of worlds typically look like small utopias where existing humans get to live out long and happy lives (because of 1/trillion kindness), then they're naturally going to think that aligned AI can only be better than that. So even if s-risks apply almost equally to both aligned and unaligned AI, I still want people to talk about it when talking about unaligned AIs, or take some other measure to ensure that people aren't potentially misled like this. (It could be that I'm just worrying too much here, that empirically people who read your top-level comment won't get the impression that close to 50% of worlds with unaligned AIs will look like small utopias. If this is what you think, I guess we could try to find out, or just leave the discussion here.) Maybe the AI develops it naturally from multi-agent training (intended to make the AI more competitive in the real world) or the AI developer tried to train some kind of morality (e.g. sense of fairness or justice) into the AI.

I would also call this one for Eliezer. I think we mostly just retrain AI systems without reusing anything. I think that's what you'd guess on Eliezer's model, and very surprising on Robin's model. The extent to which we throw things away is surprising even to a very simple common-sense observer.

I would have called "Human content is unimportant" for Robin---it seems like the existing ML systems that are driving current excitement (and are closest to being useful) lean extremely heavily on imitation of human experts and mostly don't make new knowledge thems... (read more)

4Max H14d
If someone succeeds in getting, say a ~13B parameter model to be equal in performance (at high-level tasks) to a previous-gen model 10x that size, using a 10x smaller FLOPs budget during training, isn't that a pretty big win for Eliezer? That seems to be kind of what is happening: this list [https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard] mostly has larger models at the top, but not uniformly so. I'd say, it was more like, there was a large minimum amount of compute needed to make things work at all, but most of the innovation in LLMs comes from algorithmic improvements needed to make them work at all. Hobbyists and startups can train their own models from scratch without massive capital investment, though not the absolute largest ones, and not completely for free. This capability does require massive capital expenditures by hardware manufacturers to improve the underlying compute technology sufficiently, but massive capital investments in silicon manufacturing technology are nothing new, even if they have been accelerated and directed a bit by AI in the last 15 years. And I don't think it would have been surprising to Eliezer (or anyone else in 2008) that if you dump more compute at some problems, you get gradually increasing performance. For example, in 2008, you could have made massive capital investments to build the largest supercomputer in the world, and gotten the best chess engine by enabling the SoTA algorithms to search 1 or 2 levels deeper in the Chess game tree. Or you could have used that money to pay for researchers to continue looking for algorithmic improvements and optimizations.

My objection is that the simplified message is wrong, not that it's too alarming. I think "misaligned AI has a 50% chance of killing everyone" is practically as alarming as "misaligned AI has a 95% chance of killing everyone," while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It's unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn't be doubling down on them in the position in argument.

I don't think misaligned AI drives the majority of s-risk (... (read more)

I regret mentioning "lie-to-children" as it seems a distraction from my main point. (I was trying to introspect/explain why I didn't feel as motivated to express disagreement with the OP as you, not intending to advocate or endorse anyone going into "the business of telling lies-told-to-children to adults".)

My main point is that I think "misaligned AI has a 50% chance of killing everyone" isn't alarming enough, given what I think happens in the remaining 50% of worlds, versus what a typical person is likely to infer from this statement, especially after se... (read more)

As I said:

I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.

I think it's totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don't believe it's because AI doesn't care at all one way or the other (such that you should make predictions based on instrumental reasoning like "the AI will kill humans because it's the easiest way to avoid future conflict" or other relatively small considerations).

5Wei Dai15d
I'm worried that people, after reading your top-level comment, will become too little worried about misaligned AI (from their selfish perspective), because it seems like you're suggesting (conditional on misaligned AI) 50% chance of death and 50% alive and well for a long time (due to 1/trillion kindness), which might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age. I feel like "misaligned AI kills everyone because it doesn't care at all" can be a reasonable lie-to-children [https://en.wikipedia.org/wiki/Lie-to-children] (for many audiences) since it implies a reasonable amount of concern about misaligned AI (from both selfish and utilitarian perspectives) while the actual all-things-considered case for how much to worry (including things like simulations, acausal trade, anthropics, bigger/infinite universes, quantum/modal immortality, s-risks, 1/trillion values) is just way too complicated and confusing to convey to most people. Do you perhaps disagree and think this simplified message is too alarming?

To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.

I think a closer summary is:

Humans and AI systems probably want different things. From the human perspective, it would be better if the universe was determined by what the humans wanted. But we shouldn't be willing to pay huge costs, and shouldn't attempt to create a slave society where AI systems do humans' bidding forever, just to ensure that human values win out. After all, we really wouldn't want that outcome if our situations had been reversed. And indeed we are the beneficiary of similar values-turnover in the past, as our ancestors have been open (p

... (read more)
2So8res15d
Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn't have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like where it's essentially trying to argue that this intuition pump still has force in precisely this case.

We're not talking about practically building minds right now, we are talking about humans.

We're not talking about "extrapolating volition" in general.  We are talking about whether---in attempting to help a creature with preferences about as coherent as human preferences---you end up implementing an outcome that creature considers as bad as death.

For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about ... (read more)

8So8res15d
Is this a reasonable paraphrase of your argument? If so, one guess is that a bunch of disagreement lurks in this "intuitively-reasonable manner" business. A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it's pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.) More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation. (I separately expect that if we were doing something more like the volition-extrapolation thing, we'd be tempted to bend the process towards "and they learn the meaning of friendship".) That said, this conversation is updating me somewhat towards "a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them", on the grounds that the argument "maybe preferences-about-existing-agents is just a common way for rando drives to shake out" plausibly supports it to a threshold of at least 1 in 1000. I'm not sure where I'll end up on that front. Another attempt at naming a crux: It looks to me like you see this human-style caring about others' preferences as particularly "simple" or "natural", in a way that undermines "drawing a target around the bullseye"-type arguments, whereas I could see that argument working for "grant all their wishes (within a budget)" but am much more skeptical when it comes to "do right by them in an intuitively-reasonable way". (But that still leaves room for an update

Hypothesis 1 is closer to the mark, though I'd highlight that it's actually fairly unclear what you mean by "cosmopolitan values" or exactly what claim you are making (and that ambiguity is hiding most of the substance of disagreements).

I'm raising the issue of pico-pseudokindness here because I perceive it as (i) an important undercurrent in this post, (ii) an important part of the actual disagreements you are trying to address. (I tried to flag this at the start.)

More broadly, I don't really think you are engaging productively with people who disagree wi... (read more)

2So8res15d
Thanks! I'm curious for your paraphrase of the opposing view that you think I'm failing to understand. (I put >50% probability that I could paraphrase a version of "if the AIs decide to kill us, that's fine" that Sutton would basically endorse (in the right social context), and that would basically route through a version of "broad cosmopolitan value is universally compelling", but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I'll update.)

I disagree with this but am happy your position is laid out. I'll just try to give my overall understanding and reply to two points.

Like Oliver, it seems like you are implying:

Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we'd achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as "murder everyone" is to us.

I think that normal people being pseudokind in a common-sensical way would instead say:

If we are trying to help some creatures, but tho

... (read more)
4So8res15d
My picture is less like "the creatures really dislike the proposed help", and more like "the creatures don't have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn't have endorsed if you first extrapolated their volition (but nobody's extrapolating their volition or checking against that)". It sounds to me like your stance is something like "there's a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition", which I am much more skeptical of than the weaker "most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense".

For the first half, can you elaborate on what 'actual emotional content' there is in this post, as opposed to perceived emotional content?

I mean that if you tell a story about the AI or aliens killing everyone, then the valence of the story is really tied up with the facts that (i) they killed everyone, and weren't merely "not cosmopolitan," (ii) this is a reasonably likely event rather than a possibility.

My best guess for the second half is that maybe the intended meaning was: 'this particular post looks wrong in an important way (relating to the 'actual

... (read more)
1M. Y. Zuo16d
I take it 'valence' here means 'emotional valence', i.e. the extent to which an emotion is positive or negative?

I think some of the confusion here comes from my using "kind" to refer to "respecting the preferences of existing weak agents," I don't have a better handle but could have just used a made up word.

I don't quite understand your objection to my summary---it seems like you are saying that notions like "kindness" (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up... (read more)

1peligrietzer3d
Possibly relevant [https://www.lesswrong.com/posts/eutmuwTpHCb4xYZfo/some-thoughts-on-virtue-ethics-for-ais]? 
9habryka16d
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.  Tabooing "kindness" I am saying something like:  Yes, I don't think extrapolated current humans assign approximately any value to the exact preference of "respecting the preferences of existing weak agents" and I don't really believe that you would on-reflection endorse that preference either.  Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like 'agent' being a meaningful concept in the first place, or 'existing' or 'weak' or 'preferences', all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn't feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement. The reason why I objected to this characterization is that I was trying to point at a more general thing than the "impartialness". Like, to paraphrase what this sentence sounds like to me, it's more as if someone from a pre-modern era was arguing about future civilizations and said "It's weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow".  Like, after a bunch of ontological reflection and empirical data gathering, "gods" is just really not a good abstraction for things I care about anymore. I don't think "impartiality" is what is causing me to not care about gods, it's just that the concept of "gods" seems fake and doesn't carve reality at its joints anymore. It's also not the case that I don't care at all about ancient gods anymore (they are pretty cool and I like the aes

Eliezer has a longer explanation of his view here.

My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn't have update too much from humans, and there is an important background assumption that ki... (read more)

Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:

I'm not quite sure what argument you're trying to have here. Two explicit hypotheses follow, that I haven't managed to distinguish between yet.

Background context, for establishing common language etc.:

  • Nate is trying to make a point about inclusive cosmopolitan values being a part of the human inheritance, and not universally compelling.
  • Paul is trying to make a point about how there's a decent chance that practical AIs will plausibly car
... (read more)

Short version: I don't buy that humans are "micro-pseudokind" in your sense; if you say "for just $5 you could have all the fish have their preferences satisfied" I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.


Meta:

Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decisio

... (read more)
1Max H16d
This comment changed my mind on the probability that evolved aliens are likely to end up kind, which I now think is somewhat more likely than 5%. I still think AI systems are unlikely to have kindness, for something like the reason you give at the end: I actually think it's somewhat likely that ML systems won't value kindness at all before they are superhuman enough to take over. I expect kindness as a value within the system itself not to arise spontaneously during training, and that no one will succeed at eliciting it deliberately before take over. (The outward behavior of the system may appear to be kind, and mechanistic interpretability may show that some internal component of the system has a correct understanding of kindness. But that's not the same as the system itself valuing kindness the way that humans do or aliens might.)

Is this a fair summary?

Humans might respect the preferences of weak agents right now, but if they thought about it for longer they'd pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.

If so, it seems like you wouldn't be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be a... (read more)

6habryka17d
No, this doesn't feel accurate. What I am saying is more something like:  The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.  The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn't be that surprised if I disagree even with other humans on their resulting conceptualization of "kindness" (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure). In other words, I think it's plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone's preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept. -----------------------------------------

I want to keep picking a fight about “will the AI care so little about humans that it just kills them all?” This is different from a broader sense of cosmopolitanism, and moreover I'm not objecting to the narrow claim "doesn't come for free." But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.

(Note: I believe that AI takeover has a ~50% probability o... (read more)

Paul, this is very thought provoking, and has caused me to update a little. But:

I loathe factory-farming, and I would spend a large fraction of my own resources to end it, if I could. 

I believe that makes me unusually kind by human standards, and by your definition.

I like chickens, and I wish them well.

And yet I would not bat an eyelid at the thought of a future with no chickens in it. 

I would not think that a perfect world could be improved by adding chickens.

And I would not trade a single happy human soul for an infinity of happy chickens.

I think that your single known example is not as benevolent as you think.

9Wei Dai15d
If a misaligned AI had 1/trillion "protecting the preferences of whatever weak agents happen to exist in the world", why couldn't it also have 1/trillion other vaguely human-like preferences, such as "enjoy watching the suffering of one's enemies" or "enjoy exercising arbitrary power over others"? From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I'm very philosophically confused about how to think about all of this.)
-2andrew sauer16d
commenting here so I can find this comment again
1M. Y. Zuo16d
For the first half, can you elaborate on what 'actual emotional content' there is in this post, as opposed to perceived emotional content? My best guess for the second half is that maybe the intended meaning was: 'this particular post looks wrong in an important way (relating to the 'actual emotional content') so the following points should be considered even though the literal claim is true'?

Eliezer has a longer explanation of his view here.

My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn't have update too much from humans, and there is an important background assumption that ki... (read more)

3Max H17d
Can't speak for Nate and Eliezer, but I expect kindness to be somewhat rare among evolved aliens (I think Eliezer's wild guess is 5%? That sounds about right to me), and the degree to which they are kind will vary, possibly from only very slightly kind (or kind only under a very cosmopolitan view of kindness), to as kind or more kind than humans. For AIs that humans are likely to build soon, I think there is significant probability (more than 50, less than 99? 90% seems fair) that they have literally 0 kindness. One reason is that I expect there is a significant chance that there is nothing within the first superintelligent AI systems to care about kindness or anything else, in the way that humans and aliens might care about something. If an AI system is superintelligent, then by assumption, some component piece of the system will necessarily have a deep and correct understanding of kindness (and many other things), and be capable of manipulating that understanding to achieve some goals. But understanding kindness is different from the system itself valuing kindness, or for there being anything at all "there" to have values of any kind whatsoever. I think that current AI systems don't provide much evidence on this question one way or the other, and as I've said elsewhere [https://www.lesswrong.com/posts/qrrEtrbLcmqr3b5uf/without-a-trajectory-change-the-development-of-agi-is-likely#Human_values_may_play_little_or_no_role_in_determining_the_values_and_goals_of_the_first_AGIs], arguments about this which rely on pattern matching human cognition to structures in current AI systems often fail to draw the understanding / valuing distinction sharply enough, in my view.  So a 90% chance of ~0 kindness is mostly just a made-up guess, but it still feels like a better guess to me than a shaky, overly-optimistic argument about how AI systems designed by processes which look nothing like human (or alien) evolution will produce minds which, very luckily for us, just so happen
8Vladimir_Nesov17d
I expect the notkilleveryone threshold is much lower than that. It takes an astronomically tiny fraction of cosmic endowment to maintain a (post)human civilization that's not too much larger than it currently is. The bigger expenditure would be accomodating humanity at the start, slightly delaying initial self-improvement and expansion from Earth. The cheapest way would be to back up human minds; or if that's too onerous then even merely the generic code and the Internet (which would be completely free; there is the issue that everyone would die, but human civilization still survives). Then, after the time- and resource-sensitive parts of initial expansion are done, humanity can be instantiated again. So all it takes is enough kindness to slightly divert initial expansion efforts to make a backup, not even to physically leave us be. I think this sort of kindness should be an explicit alignment target. Even if we completely fail at values alignment and corrigibility, this is the backup plan that should be an actual plan and not just a hope. Even prosaic alignment that tries to instill corporate speak should also do a token effort to try and instill a bit of kindness, even if that probably has no direct effect. Adding kindness to every aspect of AI might still leave a tiny mark. Not even trying makes it less likely. (Most of my p(extinction) is in recursively self-improving AGIs with simple values built by first human-built AGIs that are not smart enough or too obedient to human operators to not-do/prevent that. So I think being wary of AI x-risk is an even more important trait for AIs to have than kindness, as it takes more of it.)

Might write a longer reply at some point, but the reason why I don't expect "kindness" in AIs (as you define it here) is that I don't expect "kindness" to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent). 

Like, different versions of kindness might or might not put almost all of their conside... (read more)

1Quinn17d
Hard agree about death/takeover decoupling! I've lately been suspecting that P(doom) should actually just be taboo'd, because I'm worried it prevents people from constraining their anticipation or characterizing their subjective distribution over outcomes. It seems very thought-stopping!
7Ben Pace17d
(Strong-upvote, weak-disagree. I sadly don't have time right now to reflect and write why I disagree with this position but I hope someone else who disagrees does.)

I agree that it seems very bad if we build AI systems that would "prefer" to tamper with sensors (including killing humans if necessary) but are prevented from doing so by physical constraints.

I currently don't see how to approach value learning (in the worst case) without solving something like ELK. If you want to take a value learning perspective, you could view ELK as a subproblem of the easy goal inference problem. If there's some value learning approach that routes around this problem I'm interested in it, but I haven't seen any candidates and have spent a long time talking with people about it.

I think it was confusing for me to use "correlation" to refer to a particular source of correlation. I probably should have called it something like "similarity." But I think the distinction is very real and very important, and crisp enough to be a natural category.

More precisely, I think that:

Alice and Bob are correlated because Alice is similar to Bob (produced by similar process, running similar algorithm, downstream of the same basic truths about the universe...)

is qualitatively and crucially different from:

Alice and Bob are correlated because Alice is

... (read more)

I think you have to care about what happens to other agents. That might be "other paperclippers."

If you only care about what happens to you personally, then I think the size of the universe isn't relevant to your decision.

2Daniel Kokotajlo25d
Ahhh, I see. I think that's a bit misleading, I'd say "You have to care about what happens far away," e.g. you have to want there to be paperclips far away also. (The current phrasing makes it seem like a paperclipper wouldn't want to do ECL) Also, technically, you don't actually have to care about what happens far away either, if anthropic capture is involved.

If safety literally came down to sensor hardening, I do think cryptographic mechanisms (particularly tamper-proof hardware with cryptographic secrets that destroys itself if it detects trouble) seem like a relevant tool, and it's quite plausibly you could harden sensors even against wildly superhuman attackers.

It's an insane-feeling scenario---holistically I doubt it will matter for a variety of reasons, and from a worst-case perspective it's still not something you can rely on---but I do think there's some value in pinning these things down.

(In this parti... (read more)

1RogerDearnaley19d
I'd really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening. What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you're transforming the problem from "How do we know the machine isn't lying to us?" to "How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?" It also explicitly requires the machine to build a model of "what humans want", and then the complexity level and latent knowledge content required is fairly similar between "figure out what the humans want and then do that" and "figure out what the humans want and then show them a video of what doing that would look like". Maybe we should just figure out some way to do surprise inspections on the vault? :-)

I'm most convinced by the second sentence:

They would have to be specifically designed to do so.

Which definitely seems to be dismissing the possibility of alignment failures.

My guess would be that he would back off of this claim if pushed on it explicitly, but I'm not sure. And it is at any rate indicative of his attitude.

We tried talking about AI Alignment, and that’s also not going so great.

Eliezer defined AI alignment as "the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world."

If you try to use this definition, and other people use AI alignment to refer to things that they think are relevant to whether advanced AI produces good outcomes, you can't really object as a matter of linguistics. I have no sympathy and am annoyed at the complaining. You can see the last LW discu... (read more)

4TAG1mo
Alignment seems to mean something that's alternative to control because of the ordinary meanings of the words. If you stipulate that you are using "alignment" to mean something else, you are going to face a perpetual uphill battle. If you define a dog as a domestic pet that either barks or meows, expect people to be confused. Example: https://www.lesswrong.com/posts/ykccy6LXfmTjZcp6S/separating-the-control-problem-from-the-alignment-problem [https://www.lesswrong.com/posts/ykccy6LXfmTjZcp6S/separating-the-control-problem-from-the-alignment-problem]

That we're probably bottlenecked on search algorithms, rather than compute power/model-size. This would have policy implications.

If a model can't carry out good enough reasoning to solve IMO problems, then I think you should expect a larger gap between the quality of LM thinking and the quality of human thinking. This suggests that we need bigger models to have a chance of automating challenging tasks, even in domains with reasonably good supervision.

Why would failure to solve the IMO suggest that search is the bottleneck?

1azsantosk1mo
My model is that the quality of the reasoning can actually be divided into two dimensions, the quality of intuition (what the "first guess" is), and the quality of search (how much better you can make it by thinking more). Another way of thinking about this distinction is as the difference between how good each reasoning step is (intuition), compared to how good the process is for aggregating steps into a whole that solves a certain task (search). It seems to me that current models are strong enough to learn good intuition about all kinds of things with enough high-quality training data, and that if you have good enough search you can use that as an amplification mechanism (on tasks where verification is available) to improve through self-play. This being right then failure to solve IMO probably means a good search algorithm (analogous to AlphaZero's MCTS-UCT, maybe including its own intuition model) has not been found that is capable of amplifying the intuitions useful for reasoning. So far all problem-solving AIs seem to use linear or depth-first search, that is, you sample one token at a time (one reasoning step), chain them up depth-first (generate a full text/proof-sketch) check to see if it solves the full problem, and if it doesn't work then it just tries again from scratch throwing all the partial work away. No search heuristic is used, no attempt to solve smaller problems first, etc. So it can certainly get a lot better than that (which is why I'm making the bet).

"Prove or disprove X" is only like 2x harder than "Prove X." Sometimes the gap is larger for humans because of psychological difficulties, but a machine can literally just pursue both in parallel. (That said, research math involves a ton of problems other than prove or disprove.)

I basically agree that IMO problems are significantly easier than research math or other realistic R&D tasks. However I think that they are very much harder than the kinds of test questions that machines have solved so far. I'm not sure the difference is about high school math ... (read more)

1sanxiyn1mo
I agree timescale is a good way to think about this. My intuition is if high school math problems are 1 then IMO math problems are 100(1e2) and typical research math problems are 10,000(1e4). So exactly half way! I don't have first hand experience with hardest research math problems, but from what I heard about timescale they seem to reach 1,000,000(1e6). I'd rate typical practical R&D problems 1e3 and transformative R&D problems 1e5. Edit: Using this scale, I rate GPT-3 at 1 and GPT-4 at 10. This suggests GPT-5 for IMO, which feels uncomfortable to me! Thinking about this, I think while there are lots of 1-data and 10-data, there are considerably less 100-data and above that most things are not written down. But maybe that is an excuse and it doesn't matter.
4Liam Donovan1mo
In the MIRI dialogues from 2021/2022 I thought you said you would update to 40% of AGI by 2040 if AI got an IMO gold medal by 2025? Did I misunderstand or have you shifted your thinking (if so, how?)
3Jeremy Hahn1mo
At least in certain areas of mathematics, research problems are often easier than the harder IMO problems. That is to say, you can get pretty far if you know a lot of previously proven results and combine them in relatively straightforward ways. This seems especially true in areas where it is hard for a single human to know a lot of results, just because it takes a long time to read and learn things.

No. You should think of reversible computers as quantitatively more efficient than normal computers, but basically the same thing. If you have polynomial time and energy then reversible computers still implement P.

My post observes that if the universe had a long enough lifetime, then a reversible computer would be able to compute PSPACE even if you only started out with polynomially much negentropy (whereas a normal computer would quickly burn up all the negentropy and leave you with a universe too hot to compute). So time and space and durability might be... (read more)

Yeah, I can believe the problem is NP hard if you need to distinguish  eigenvalues in time  rather than . I don't understand exact arithmetic very well but it seems wild (e.g my understanding is that no polynomial time algorithm is known for testing ).

Agree about the downside of fragmented discussion, I left some time for MO discussion before posting the bounty but I don't know how much it helped.

I agree that some approach like that seems like it could work for . I'm not familiar with those techniques and so don't know a technique that needs  iterations rather than  or . For Newton iteration it seems like you need some kind of bound on the condition number.

But I haven't thought about it much, since it seems like something fundamentally different is ne... (read more)

It can be decided in  time by semidefinite programming, and I'd guess that most researchers would expect it can be done in time  (which is the runtime of PSD testing, i.e. the case when ).

More generally, there's a large literature on positive semidefinite completions which is mostly concerned with finding completions with particular properties. Sorry for not better flagging the state of this problem. Those techniques don't seem super promising for getting to .

1LGS1mo
Wait which algorithm for semidefinite programming are you using? The ones I've seen look like they should translate to a runtime even slower than O(nω√m). For example the one here: https://arxiv.org/pdf/2009.10217.pdf Also, do you have a source for the runtime of PSD testing being nω? I assume no lower bound is known, i.e. I doubt PSD testing is hard for matrix multiplication. Am I wrong about that?
5blf1mo
I think it can be done in O(nω), where I recall for non-expert's convenience that ω is the exponent of matrix multiplication / inverse / PSD testing / etc. (all are identical). Let Mn be the space of n×n matrices and let V be the (2m−n)-dimensional vector space of matrices with zeros in all non-specified entries of the problem.  The maximum-determinant completion is the (only?) one whose inverse is in V.  Consider the map V→Mn,B↦B−1 and its projection f:V→V where we zero out all of the other entries.  The function f can be evaluated in time O(nω).  We wish to solve f(B)=A.  This should be doable using a Picard or Newton iteration, with a number of steps that depends on the desired precision. Would it be useful if I try to spell this out more precisely?  Of course, this would not be enough to reach O(mn) in the small m case.  Side-note: The drawback of having posted this question in multiple places at the same time is that the discussion is fragmented.  I could move the comment to mathoverflow if you think it is better.
1Zygi Straznickas1mo
Thanks for the clarification! I assumed the question was using the bit computation model (biased by my own experience), in which case the complexity of SDP in the general case still seems to be unknown (https://math.stackexchange.com/questions/2619096/can-every-semidefinite-program-be-solved-in-polynomial-time [https://math.stackexchange.com/questions/2619096/can-every-semidefinite-program-be-solved-in-polynomial-time])

If you are given the dot products exactly, that's equivalent to being given the distances exactly.

(I think this reduction goes the wrong way though, if I wanted to solve the problem about distances I'd translate it back into dot products.)

2faul_sname1mo
I think you can convert between the two representations in O(m) time, which would mean that any algorithm that solves either version in O(n*m) solves both in O(n*m). Do you have some large positive and negative examples of the kinds of sparse matrix you're trying to check for the existence of a PSD completion on, or alternatively a method for generating such examples with a known ground truth? I have a really dumb idea for a possible algorithm here (that shamelessly exploits the exact shape of this problem in a way that probably doesn't generalize to being useful for broader problems like MDS) that I think would complete in approximately the time constraints you're looking for. It almost certainly won't work, but I think it's at least worth an hour of my time to check and figure out why (especially since I'm trying to improve my linear algebra skills anyway). Edit: there's the obvious approach, which I'm trying, of "start with only 1s on the diagonal and then keep adding random entries until it no longer has a PSD completion, then removing random entries until it does, and repeat to build a test set" but I doubt that covers the interesting corners of the problem space. Edit 2: the really dumb thing does not work. I think I haven't ruled out that a slightly less dumb approach could work though? Edit 3: never mind, my really dumb "solution" requires inverting a matrix that is, in the worst case, nxn, if e.g. you have an input that looks like 1 n n n n n n n 1 - - - - n n - 1 - - - n n - - 1 - - n n - - - 1 - n n - - - - 1 n n n n n n n 1 you'll have to invert 6 2x2 matrices and one each of 3x3 to 7x7 matrices.
1dr_s1mo
You mean as in, if I have two points xi and xj, then (xi−xj)2=x2i+x2j−2xi⋅xj? Guess so, but yeah, I just don't tend to think of it immediately as a part of the distances rather than just vectors. It works though since you mentioned the diagonal elements are known.

I changed the statement to assume the diagonal is known, I agree this way is cleaner and prevents us from having to think about weird corner cases.

Sorry to create headache for you guys, thanks for handling imports!

2habryka1mo
No worries! Sorry for the fickleness. I just made a PR to handle non-admin editors more robustly.

I don't think that anyone has ever mentioned problem #2 in the literature and so I would at least consider that answer somewhat misleading.

I strongly suspect that if we put in a similar-sounding but very easy problem GPT-4 would be just as likely to say that it's an open question.

1RomanS1mo
Initially, it has provided a solution for the first problem. I asked it: "please check your answer for any errors", and then it apologized for the erroneous solution, and made the aforementioned assessment that the problem is still open. With the second problem, it was a somewhat similar story: it answered with a partial solution involving a Cholesky decomposition, and then clarified that the problem is open.  Seems to be worth talking with GPT-4 on the matter. Perhaps it could give some useful hints. 

I agree with your example. I think you can remove the row/column if there are no known 0s on the diagonal, and I also think you can first WLOG remove any rows and columns with 0s on the diagonal (since if there are any other known non-zero entries in their row or column then the matrix has no PSD completions).

I'm also happy to just assume the diagonal entries are known. I think these are pretty weird corner cases.

2Thomas Sepulchre1mo
What matters is not whether or not there is another 0 on the diagonal, but whether or not there is another PSD non definite matrix on the diagonal. For example, in the comment from Jacob_Hilton, they introduce the 2x2 matrix ((1,1),(1,1)), with eigenvalues [0,1], which is a PSD non definite.  I agree with assuming that all diagonal entries are known. You can even assume that all entries are 1 on the diagonal WLOG.
5Jacob_Hilton1mo
It's not quite this simple, the same issue arises if every PSD completion of the known-diagonal minor has zero determinant (e.g. ((?, 1, 2), (1, 1, 1), (2, 1, 1))). But I think in that case making the remaining diagonal entries large enough still makes the eigenvalues at least −ε, which is good enough.
  1. Yes, the algorithm should work for adversarial inputs.
  2. For the second problem you can effectively read more entries since you can compute any given entry of  in time . For the first problem you can't read more entries. But you can assume WLOG that you have the diagonal---if you don't have any given diagonal entry, you might as well just set it to infinity in the completion, effectively removing that whole row and column from the problem. So we can just remove all rows and columns for which you don't have the diagonal, and be left with
... (read more)
8Thomas Sepulchre1mo
I'm not sure this is true. Consider the 2x2 matrix ((?, 1),(1,0)). Removing the first row and the first column leaves you with ((0)), which is a PSD 1x1 matrix. That being said, there is no value of ? for which the 2x2 matrix is PSD.

I agree you need feedback from the world; you need to do experiments. If you wanted to get a 50% chance of launching a rocket successfully on the first time (at any reasonable cost) you would need to do experiments.

The equivocation between "no opportunity to experiment" and "can't retry if you fail" is doing all the work in this argument.

A more reasonable guess is that expected odds of the first Starship launch failure go down logarithmically with budget and time.

That's like saying that it takes 10 people to get 90% reliability, 100 people to get to 99% reliability, and a hundred million people to get to 99.99% reliability. I don't think it's a reasonable model though I'm certainly interested in examples of problems that have worked out that way.

Linear is a more reasonable best guess. I have quibbles, but I don't think it's super relevant to this discussion. I expect the starship first failure probability was >>90%, and we're talking about the difficulty of getting out of that regime.

0shminux2mo
Conditional on it being a novel and complicated design. I routinely churn six-sigma code when I know what I am doing, and so do most engineers. But almost never on the first try! The feedback loop is vital, even if it is slow and inefficient. For anything new you are fighting not so much the designs, but human fallibility. Eliezer's point is that it if you have only one try to succeed, you are hooped. I do not subscribe to the first part, I think we have plenty of opportunities to iterate as LLM capabilities ramp up, but, conditional on "perfect first try or extinction", our odds of survival are negligible. There might be alignment by default, or some other way out, but conditional on that one assumption, we have no chance in hell. It seems to me that you disagree with that point, somehow. That by pouring more resources upfront into something novel, we have good odds of succeeding on the first try, open loop. That is not a tenable assumption, so I assume I misunderstood something.

There are many industries where it is illegal to do things in the fastest or easiest way. I'm not exactly sure what you are saying here.

-9hairyfigment2mo

I think Eliezer's tweet is wrong even if you grant the rocket <> alignment analogy (unless you grant some much more extreme background views about AI alignment).

Assume that "deploy powerful AI with no takeover" is exactly as hard as "build a rocket that flies correctly the first time even though it has 2x more thrust than anything anyone as tested before." Assume further that an organization is able to do one of those tasks if and only if it can do the other.

Granting the analogy, the relevant question is how much harder it would be to successfully la... (read more)

2shminux2mo
I think you are way underestimating. A more reasonable guess is that expected odds of the first Starship launch failure go down logarithmically with budget and time. Even if you grant a linear relationship, reducing the odds of failure from 10% to 1% means 10x the budget and time. If you want to never fail, you need an infinite budget and time. If the failure results in an extinction event, then you are SOL. 
-4hairyfigment2mo
>Instead it just means that Bob shouldn't rely on his company doing the fastest and easiest thing and having it turn out fine. Instead Bob should expect to make sacrifices, either burning down a technical lead or operating in (or helping create) a regulatory environment where the fastest and easiest option isn't allowed. The above feels so bizarre that I wonder if you're trying to reach Elon Musk personally. If so, just reach out to him. If we assume there's no self-reference paradox involved, we can safely reject your proposed alternatives as obviously impossible; they would have zero credibility even if AI companies weren't in an arms race, which appears impossible to stop from the inside unless all the CEOs involved can meet at Bohemian Grove.

It's unclear whether some people being cautious and some people being incautious leads to an AI takeover.

In this hypothetical, I'm including AI developers selling AI systems to law enforcement and militaries, which are used to enforce the law and win wars against competitors using AI. But I'm assuming we wouldn't pass a bunch of new anti-AI laws (and that AI developers don't become paramilitaries).

I think the substance of my views can be mostly summarized as:

  • AI takeover is a real thing that could happen, not an exotic or implausible scenario.
  • By the time we build powerful AI, the world will likely be moving fast enough that a lot of stuff will happen within the next 10 years.
  • I think that the world is reasonably robust against extinction but not against takeover or other failures (for which there is no outer feedback loop keeping things on the rails).

I don't think my credences add very much except as a way of quantifying that basic stance. I largely made this post to avoid confusion after quoting a few numbers on a podcast and seeing some people misinterpret them.

4Richard_Ngo2mo
Yepp, agree with all that.

Consider a competent policy that wants paperclips in the very long run. It could reason "I should get a low loss to get paperclips," and then get a low loss. As a result, it could be selected by gradient descent.

I think almost all the cumulative takeover probability is within 10 years of building powerful AI. Didn't draw the distinction here, but my views aren't precise enough to distinguish.

Actually I think my view is more like 50% from AI systems built by humans (compared to 15% unconditionally), if there is no effort to avoid takeover.

If you continue assuming "no effort to avoid takeover at all" into the indefinite future then I expect eventual takeover is quite likely, maybe more like 80-90% conditioned on nothing else going wrong, though in all these questions it really matters a lot what exactly you mean by "no effort" and it doesn't seem like a fully coherent counterfactual.

These predictions are not very related to any alignment research that is currently occurring. I think it's just quite unclear how hard the problem is, e.g. does deceptive alignment occur, do models trained to honestly answer easy questions generalize to hard questions, how much intellectual work are AI systems doing before they can take over, etc.

I know people have spilled a lot of ink over this, but right now I don't have much sympathy for confidence that the risk will be real and hard to fix (just as I don't have much sympathy for confidence that the problem isn't real or will be easy to fix).

These are all-things-considered estimates, including the fact that we will do our best to prevent AI risk.

The probability of human survival is primarily driven by AI systems caring a small amount about humans (whether due to ECL, commonsense morality, complicated and messy values, acausal trade, or whatever---I find all of those plausible).

I haven't thought deeply about this question, because a world where AI systems don't care very much about humans seems pretty bad for humans in expectation. I do think it matters whether the probability we all literally die is 10% or 50% or 90%, but it doesn't matter very much to my personal prioritization.

I think these questions are all still ambiguous, just a little bit less ambiguous.

I gave a probability for "most" humans killed, and I intended P(>50% of humans killed). This is fairly close to my estimate for E[fraction of humans killed].

I think if humans die it is very likely that many non-human animals die as well. I don't have a strong view about the insects and really haven't thought about it.

In the final bullet I implicitly assumed that the probability of most humans dying for non-takeover reasons shortly after building AI was very similar to the probability of human extinction; I was being imprecise, I think that's kind of close to true but am not sure exactly what my view is.

I'm somewhat optimistic that AI takeover might not happen (or might be very easy to avoid) even given no policy interventions whatsoever, i.e. that the problem is easily enough addressed that it can be done by firms in the interests of making a good product and/or based on even a modest amount of concern from their employees and leadership. Perhaps I'd give a 50% chance of takeover with no policy effort whatsoever to avoid it, compared to my 22% chance of takeover with realistic efforts to avoid it.

I think it's pretty hard to talk about "no policy effort w... (read more)

1Max H1mo
To clarify, the conditional probability in the parent comment is not conditioned on no policy effort or intervention, it's conditional on whatever policy / governance / voluntary measures are tried being insufficient or ineffective, given whatever the actual risk turns out to be. If a small team hacking in secret for a few months can bootstrap to superintelligence using a few GPUs, the necessary level of policy and governance intervention is massive. If the technical problem has a somewhat different nature, then less radical interventions are plausibly sufficient. I personally feel pretty confident that: * Eventually, and maybe pretty soon (within a few years), the nature of the problem will indeed be that it is plausible a small team can bootstrap to superintelligence in secret, without massive resources. * Such an intelligence will be dramatically harder to align than it is to build, and this difficulty will be non-obvious to many would-be builders. And believe somewhat less confidently that: * The governance and policy interventions necessary to robustly avert doom given these technical assumptions are massive and draconian. * We are not on track to see such interventions put in place. Given different views on the nature of the technical problem (the first two bullets), you can get a different level of intervention which you think is required for robust safety (the third bullet), and different estimate that such an intervention is put in place successfully (the fourth bullet). I think it's also useful to think about cases where policy interventions were (in hindsight) obviously not sufficient to prevent doom robustly, but by luck or miracle (or weird anthropics) we make it through anyway. My estimate of this probability is that it's really low - on my model, we need a really big miracle, given actually-insufficient intervention. What "sufficient intervention" looks like, and how likely we are to get it, I find much harder to estimate.
1laserfiche2mo
Are you assuming that avoiding doom in this way will require a pivotal act? It seem absent policy intervention and societal change, even if some firms exhibit a proper amount of concern many others will not.
8paulfchristiano2mo
Actually I think my view is more like 50% from AI systems built by humans (compared to 15% unconditionally), if there is no effort to avoid takeover. If you continue assuming "no effort to avoid takeover at all" into the indefinite future then I expect eventual takeover is quite likely, maybe more like 80-90% conditioned on nothing else going wrong, though in all these questions it really matters a lot what exactly you mean by "no effort" and it doesn't seem like a fully coherent counterfactual.
1sudo -i2mo
I'm curious about how contingent this prediction is on 1, timelines and 2, rate of alignment research progress. On 2, how much of your P(no takeover) comes from expectations about future research output from ARC specifically? If tomorrow, all alignment researchers stopped working on alignment (and went to become professional tennis players or something) and no new alignment researchers arrived, how much more pessimistic would you become about AI takeover?
Load More