Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is an experiment in having an Open Thread dedicated to AI Alignment discussion, hopefully enabling researchers and upcoming researchers to ask small questions they are confused about, share very early stage ideas and have lower-key discussions.

New Comment
96 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Wei DaiΩ7140

Has anyone seen this argument for discontinuous takeoff before? I propose that there will be a discontinuity in AI capabilities at the time that the following strategy becomes likely to succeed:

  1. Use hacking or phishing to take over a computing center belonging to someone else.
  2. Expand self (i.e., the AI executing the current strategy) into the new computing center.
  3. Repeat steps 1 & 2 on other computing centers (in increasing order of their security) using the increased capabilities of the expanded AI.
  4. Defend self and figure out how to take over or neutralize the rest of the world.

The reason for the discontinuity is that this strategy is an all-or-nothing kind of thing. There is a threshold in the chance of success in taking over other people's hardware, below which you're likely to get caught and punished/destroyed before you take over the world (and therefore almost nobody attempts it, and the few who do just quickly get caught), and above which the above strategy becomes feasible.

There's previously been the "an AI could achieve a discontinuous takeoff by exploiting a security vulnerability to copy itself into lots of other computers" argument in at least Sotala 2012 (sect 4.1.) and Sotala & Yampolskiy 2015 (footnote 15), though those don't explicitly mention the "use the additional capabilities to break into even more systems" part. (It seems reasonably implicitly there to me, but that might just be illusion of transparency speaking.)
5Jalex Stark
I think Bostrom uses the term "hardware overhang" in Superintelligence to point to a cluster of discontinuous takeoff scenarios including this one
7Wei Dai
It seems to me that there's a counter-argument available to the "hardware overhang" argument for discontinuous takeoff that doesn't apply to the "hacking" argument, namely that for any AI that achieves a high level of capability by taking advantage of hardware overhang, there will be an AI that arrives a bit earlier and achieves a somewhat lower level of capability by taking advantage of the same hardware overhang (e.g., because it has somewhat worse algorithms, or somewhat less or lower quality training data). Unlike the "hacking" scenario, in the generic "hardware overhang" scenario, there's not an apparent threshold effect that could cause a discontinuity. (Curiously, Paul Christiano's and AI Impacts's posts arguing against discontinuous takeoff both ignore "hardware overhang" and neither give this counter-argument. Neither of them mention the "hacking" argument either, AFAICT.)
Wasn't hardware overhang the argument that if AGI is more bottlenecked by software than hardware, then conceptual insights on the software side could cause a discontinuity as people suddenly figured out how to use that hardware effectively? I'm not sure how your counterargument really works there, since the AI that arrives "a bit earlier" either precedes or follows that conceptual breakthrough. If it precedes the breakthrough, then it doesn't benefit from that conceptual insight so won't be powerful enough to take advantage of the overhang, and if it follows it, then it has a discontinuous advantage over previous systems and can take advantage of hardware overhang. --- Separately, your comment also feels related to my argument that focusing on just superintelligence is a useful simplifying assumption, since a superintelligence is almost by definition capable of taking over the world. But it simplifies things a little too much, because if we focus too much on just the superintelligence case, we might miss the emergence of a “dumb” AGI which nevertheless had the "crucial capabilities" necessary for a world takeover. In those terms, "having sufficient offensive cybersecurity capability that a hacking attempt can snowball into a world takeover" would be one such crucial capability that allowed for a discontinuity.
2David Scott Krueger (formerly: capybaralet)
Yes. Not a direct response: It's been argued (e.g. I think Paul said this in his 2nd 80k podcast interview?) that this isn't very realistic, because the low-hanging fruit (of easy to attack systems) is already being picked by slightly less advanced AI systems. This wouldn't apply if you're *already* in a discontinuous regime (but then it becomes circular). Also not a direct response: It seems likely that some AIs will be much more/less cautious than humans, because they (e.g. implicitly) have very different discount rates. So AIs might take very risky gambles, which means both that we might get more sinister stumbles (good thing), but also that they might readily risk the earth (bad thing).
1Jess Smith
I wonder how plausible it is that the AI would be able to take over a second computing center before being detected in the first. (Which would then presumably be shut down)

(Short writeup for the sake of putting the idea out there)

AI x-risk people often compare coordination around AI to coordination around nukes. If we ignore military applications of AI and restrict ourselves to misalignment, this seems like a weird analogy to me:

  • With technical AI safety we're primarily thinking about accident risks, whereas nukes are deliberately weaponized.
  • Everyone can agree that we don't want nuclear accidents, so why can't everyone agree we don't want AI accidents? I think the standard response here is "everyone will trade off safety for capabilities", but did that happen with nukes?
  • I don't see any analog to mutually assured destruction, which seems like a pretty key feature with nukes.

Perhaps a more appropriate nuclear analogy for AI x-risk would be accidents like Chernobyl.

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels." He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, "What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?" I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, "Never mind, Hamming, no one will ever blame you."

2Rohin Shah
I don't really know what this is meant to imply? Maybe you're answering my question of "did that happen with nukes?", but I don't think an affirmative answer means that the analogy starts to work. I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"; the magnitude/severity of the accident risk is not that relevant to this argument.
[-]Wei DaiΩ6150

I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"

If you're arguing against that, I'm still not sure what your counter-argument is. To me, the argument is: the upsides of nukes are the ability to take over the world (militarily) and to defend against such attempts. The downsides include risks of local and global catastrophe. People raced to develop nukes because they judged the upsides to be greater than the downsides, in part because they're not altruists and longtermists. It seems like people will develop potentially unsafe AI for analogous reasons: the upsides include the ability to take over the world (militarily or economically) and to defend against such attempts, and the downsides include risks of local and global catastrophe, and people will likely race to develop AI because they judge the upsides to be greater than the downsides, in part because they're not altruists and longtermists.

Where do you see this analogy breaking down?

2Rohin Shah
I'm more sympathetic to this argument (which is a claim about what might happen in the future, as opposed to what is happening now, which is the analogy I usually encounter, though possibly not on LessWrong). I still think the analogy breaks down, though in different ways: * There is a strong norm of openness in AI research (though that might be changing). (Though perhaps this was the case with nuclear physics too.) * There is a strong anti-government / anti-military ethic in the AI research community. I'm not sure what the nuclear analog is, but I'm guessing it was neutral or pro-government/military. * Governments are staying a mile away from AGI; their interest in AI is in narrow AI's applications. Narrow AI applications are diverse, and many can be done by a huge number of people. In contrast, nukes are a single technology, governments were interested in them, and only a few people could plausibly build them. (This is relevant if you think a ton of narrow AI could be used to take over the world economically.) * OpenAI / DeepMind are not adversarial towards each other. In contrast, US / Germany were definitely adversarial.
[-]Wei DaiΩ4120

Assuming you agree that people are already pushing too hard for progress in AGI capability (relative to what's ideal from a longtermist perspective), I think the current motivations for that are mostly things like money, prestige, scientific curiosity, wanting to make the world a better place (in a misguided/shorttermist way), etc., and not so much wanting to take over the world or to defend against such attempts. This seems likely to persist in the near future, but my concern is that if AGI research gets sufficiently close to fruition, governments will inevitably get involved and start pushing it even harder due to national security considerations. (Recall that Manhattan Project started 8 years before detonation of the first nuke.) Your argument seems more about what's happening now, and does not really address this concern.

2Rohin Shah
I'm uncertain, given the potential for AGI to be used to reduce other x-risks. (I don't have strong opinions on how large other x-risks are and how much potential there is for AGI to differentially help.) But I'm happy to accept this as a premise. I think what's happening now is a good guide into what will happen in the future, at least on short timelines. If AGI is >100 years away, then sure, a lot will change and current facts are relatively unimportant. If it's < 20 years away, then current facts seem very relevant. I usually focus on the shorter timelines. For min(20 years, time till AGI), for each individual trend I identified, I'd weakly predict that trend will continue (except perhaps openness, because that's already changing).

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

For me it's because:

  • Nukes seem like an obvious Xrisk
  • People mostly seem to agree that we haven't done a good job coordinating around them
  • They seem a lot easier to coordinate around

Also, not a reason, but:

AI seems likely to be weaponized, and warfare (whether conventional or not) seems like one of the areas where we should be most worried about "unbridled competition" creating a race-to-the-bottom on safety.

5David Scott Krueger (formerly: capybaralet)
TBC, I think climate change is probably an even better analogy. And I also like to talk about international regulation, in general, like with tax havens.
2Rohin Shah
Agree that climate change is a better analogy. Disagree that nukes seem easier to coordinate around -- there are factors that suggest this (e.g. easier to track who is and isn't making nukes), but there are factors against as well (the incentives to "beat the other team" don't seem nearly as strong).
1David Scott Krueger (formerly: capybaralet)
You mean it's stronger for nukes than for AI? I think I disagree, but it's a bit nuanced. It seems to me (as someone very ignorant about nukes) like with current nuclear tech you hit diminishing returns pretty fast, but I don't expect that to be the case for AI. Also, I'm curious if weaponization of AI is a crux for us.
3Rohin Shah
I'm uncertain about weaponization of AI (and did say "if we ignore military applications" in the OP).
1David Scott Krueger (formerly: capybaralet)
Oops, missed that, sry.
I agree that the coordination games between nukes and AI are different, but I still think that nukes make for a good analogy. But not after multiple parties have developed them. Rather I think key elements of the analogy is the game changing and decisive strategic advantage that nukes/AI grant once one party develops them. There aren't too many other technologies that have that property. (maybe the bronze-iron age transition?) Where the analogy breaks down is with AI safety. If we get AI safety wrong there's a risk of large permanent negative consequences. A better analogy might be living near the end of WW2, but if you build a nuclear bomb incorrectly, it ignites the atmosphere and destroys the world. In either case, under this model, you end up with the following outcomes: * (A): Either party incorrectly develops the technology * (B): The other party successfully develops the technology * (C): My party successfully develops the technology and generally a preference ordering of A<B<C, although a sufficiently cynical actor might have B<A<C. If there's a sufficiently shallow trade-off between speed of development and the risk of error, this can lead to a dollar auction like dynamic where each party is incentivized to trade a bit more risk in order to develop the technology first. In a symmetric situation without coordination, the equilibrium nash equilibrium is all parties advancing as quickly as possible to develop the technology and throwing caution to the wind.
5Rohin Shah
Really? It seems like if I've raised my risk level to 99% and the other team has raised their risk level to 98% (they are slightly ahead), one great option for me is to commit not to developing the technology and let the other team develop the technology at risk level ~1%. This gets me an expected utility of 0.99B + 0.01A, which is probably better than the 0.01C + 0.99A that I would otherwise have gotten (assuming I developed the technology first). I am assuming common knowledge here, but I am not assuming coordination. See also OpenAI Charter.
Interesting. I had the Nash equilibrium in mind, but it's true that unlike a dollar auction, you can de-escalate, and when you take into account how your opponent will react to you changing your strategy, doing so becomes viable. But then you end up with something like a game of chicken, where ideally, you want to force your opponent to de-escalate first, as this tilts the outcomes toward option C rather than B.
5Robert Miles
Yeah, nuclear power is a better analogy than weapons, but I think the two are linked, and the link itself may be a useful analogy, because risk/coordination is affected by the dual-use nature of some of the technologies. One thing that makes non-proliferation difficult is that nations legitimately want nuclear facilities because they want to use nuclear power, but 'rogue states' that want to acquire nuclear weapons will also claim that this is their only goal. How do we know who really just wants power plants? And power generation comes with its own risks. Can we trust everyone to take the right precautions, and if not, can we paternalistically restrict some organisations or states that we deem not capable enough to be trusted with the technology? AI coordination probably has these kinds of problems to an even greater degree.
Opposition to and heavy regulation of nuclear reactors is mostly about accidents, not weapons (though at least some of the effort into tracking the material is about weapons). Everyone agrees we don't want accidents, not everyone agrees how much we should give up to prevent 100% of accidents. We have, in fact, had significant accidents. Also, accidents with weapons are definitely a thing. Human regulation and cooperation is unsolved, so even knowing the difference between accident and intent is actually somewhat hard to define for many group activities.
3Rohin Shah
I agree with this; I'm not sure what point you're trying to make? Perhaps you're suggesting that the fact that its accident risk rather than weapons risk doesn't mean that we're safe, in which case I agree. I'm only suggesting that people stop using the analogy to nukes because its misleading, I'm not saying that there's no risk as a result.
2Matthew Barnett
Perhaps the appropriate analogy here would be two teams which both say "The other team is going to get to AI first if we don't, and we prefer misalignment to losing, so we might as well push ahead." The disanalogy here is that it's not adversarial in the sense of being destructive (although it could be if they are enemies). But it's analogous in the sense that they could either both decide to do nothing, or both decide to take the action. If they decide to take the action, they will both ensure their own destruction in the case of misalignment.
3Rohin Shah
This still feels more analogous to Chernobyl? "The other team is going to get cheap nuclear energy first if we don't, and we prefer a nuclear accident to losing, so we might as well push ahead." You might argue that obviously it doesn't matter very much who gets nuclear energy first, so this wouldn't apply. I'd respond that the benefit : cost ratio here seems similar to the benefit : cost ratio for AI where the benefit is "we build a singleton" and the cost is "misaligned AGI causes extinction". Surely it's significantly better for the other team to win and build a singleton than for you to build a misaligned AGI? (Separately, I think I would argue that the "we build a singleton" case is unlikely, but that's not a crucial part of this argument.)

It seems to me that many people believe something like "We need proof-level guarantees, or something close to it, before we build powerful AI". I could interpret this in two different ways:

  • Normative claim: "Given how bad extinction is, and the plausibility of AI x-risk, it would be irresponsible of us to build powerful AI before having proof-level guarantees that it will be beneficial".
  • Empirical claim: "If we run a powerful AI system without having something like a proof of the statement 'running this AI system will be beneficial', then catastrophe is nearly inevitable".

I am uncertain on the normative claim (there might be great benefits to building powerful AI sooner, including the reduction of other x-risks), and disagree with the empirical claim.

If I had to argue briefly for the empirical claim, it would go something like this: "Since powerful AI will be world-changing, it will either be really good, or really bad -- neutral impact is too implausible. But due to fragility of value, the really bad outcomes are far more likely. The only way to get enough evidence to rule out all of the bad outcomes is to have a proof that the AI system ... (read more)

My thoughts: we can't really expect to prove something like "this ai will be beneficial". However, relying on empiricism to test our algorithms is very likely to fail, because it's very plausible that there's a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don't know how to make good guesses about the behavior of very capable systems except through mathematical analysis.

There are two overlapping traditions in machine learning. There's a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there's machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.

(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)

I don't think you need to posit a discontinuity to expect tests to occasionally fail.

I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.

I'll admit I don't feel like I really understand the perspective of people who seem to think we'll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:

  • We'll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.
    • counter-argument: but the concern is about what happens at deployment time
  • We'll deploy AI in a box, too then
    • counter: seems like that entails a massive performance hit (but it's not clear if that's actually the case)
  • We'll have other "AI police" to stop any "evil AIs" that "go rogue" (just like we have for people).
    • counter: where did the AI police come from, and why can't they go rogue as well?
  • The "AI police" can just be the rest of the AIs in the world ganging up on anyone who goes rogue.
    • counter: this seems to be assuming the "corrigibility as basin of attraction
... (read more)
3Rohin Shah
I hold this view; none of those are reasons for my view. The reason is much more simple -- before x-risk level failures, we'll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We'll notice this, understand it, and fix the issue. (A crux I expect people to have is whether we'll actually fix the issue or "apply a bandaid" that is only a superficial fix.)

Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don't see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.

If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.

3David Scott Krueger (formerly: capybaralet)
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity? E.g. will we see "sinister stumbles" (IIRC this was Adam Gleave's name for half-baked treacherous turns)? I think we will, FWIW. Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?) How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn't emphasize the "optimization" part.) Jessica's posts about MIRI vs. Paul's views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I'd expect, unless ML becomes "woke" to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we'd see something that *looks* like a discontinuity, but is *actually* more like "the same reason".
4Rob Bensinger
This in particular doesn't match my model. Quoting some relevant bits from Embedded Agency: This is also the topic of The Rocket Alignment Problem.
5David Scott Krueger (formerly: capybaralet)
Interesting. Your crux seems good; I think it's a crux for us. I expect things play out more like Eliezer predicts here: I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don't become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc... I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world. A very similar problem would be a form of longer-term "seeding", where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances ("at the margin") that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy. I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas. That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can't
3Aleksi Liimatainen
Now I'm wondering if it makes sense to model past or present cognitive-cultural information processes in a similar fashion. Memetic and cultural evolutions are a thing and any agentlike processes that spawn could piggypack on our existing general intelligence architecture.
3David Scott Krueger (formerly: capybaralet)
Yeah, I think it totally does! (and that's a very interesting / "trippy" line of thought :D) However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don't think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for "savant-like" intelligence, which is sort of what I'm imagining here. I can't think of why I have that intuition OTTMH.
2Rohin Shah
Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many "easy" things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.) All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I'd be more worried about scenarios like these.
9David Scott Krueger (formerly: capybaralet)
So I don't take EY's post as about AI researchers' competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me. I don't think I'm underestimating AI researchers, either, but for a different reason... let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I'm imagining something more like options, or having accurate generalized value functions (GVFs), than tasks. Regarding long-term planning, I'd factor this into 2 components: 1) having a good planning algorithm 2) having a good world model I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model. I don't think its necessary to have a very "complete" world-model (i.e. enough knowledge to look smart to a person) in order to find "steganographic" long-term strategies like the ones I'm imagining. I also don't think it's even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.... (i.e. be some sort of savant).
2Rohin Shah
I don't think the only alternative to proof is empiricism. Lots of people reason about evolutionary biology/psychology with neither proof nor empiricism. The mesa optimizers paper involves neither proof nor empiricism. You can also be empirical at that point though? I suppose you couldn't be empirical if you expect an either an extremely fast takeoff (i.e. order one day or less) or an inability on our part to tell when the AI reaches human-level, but this seems overly pessimistic to me.

The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:

  • They are part of a research program, not an end result. Rough intuitions can absolutely be a useful guide which (hopefully eventually) helps us figure out what mathematical results are possible and useful.
  • They primarily point at problems rather than solutions. Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for "significant risk" vs "significant good". IE, an argument that there is a risk can be fairly rough and nonetheless be sufficient for me to "not push the button" (in a hypothetical where I could choose to turn on a system today). On the other hand, an argument that pushing the button is net positive has to be actually quite strong. I want there to be a small set of assumptions, each of which individually seem very likely to be true, which taken together would be a guarantee against catastrophic failure.

[This is an "or" condition -- either one of those two condit... (read more)

4Rohin Shah
This is a normative argument, not an empirical one. The normative position seems reasonable to me, though I'd want to think more about it (I haven't because it doesn't seem decision-relevant). The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what'll happen.)
I am confused about how the normative question isn't decision-relevant here. Is it that I have a model where it is the relevant question, but you have one where it isn't? To be hopefully clear: I'm applying this normative claim to argue that proof is needed to establish the desired level of confidence. That doesn't mean direct proof of the claim "the AI will do good", but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such. It's possible that this isn't my true disagreement, because actually the question seems more complicated than just a question of how large potential downsides are if things go poorly in comparison to potential upsides if things go well. But some kind of analysis of the risks seems relevant here -- if there weren't such large downside risks, I would have lower standards of evidence for claims that things will go well. It sounds like we would have to have a longer discussion to resolve this. I don't expect this to hit the mark very well, but here's my reply to what I understand: * I don't see how you can be confident enough of that view for it to be how you really want to check. * A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out "hacks" around the "usual interpretation" of the proxy. I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be (keeping in mind that those aren't the only two things in the universe). I'm not sure which of those two disagreements is more important here.
2Rohin Shah
Under my model, it's overwhelmingly likely that regardless of what we do AGI will be deployed with less than the desired level of confidence in its alignment. If I personally controlled whether or not AGI was deployed, then I'd be extremely interested in the normative claim. If I then agreed with the normative claim, I'd agree with: If I want >99% confidence, I agree that I couldn't be confident enough in that argument. Yeah, the hope here would be that the relevant decision-makers are aware of this dynamic (due to previous situations in which e.g. a recommender system optimized the fairly good proxy of clickthrough rate but this lead to "hacks" around the "usual interpretation"), and have some good reason to think that it won't happen with the highly capable system they are planning to deploy. Agreed. It also might be that we disagree on the tractability of proofs in addition to / instead of the utility of proofs.
[-]Wei DaiΩ4100

Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I've found to a statement of the empirical claim is from Eliezer's 2012 Reply to Holden on ‘Tool AI’:

I’ve re­peat­edly said that the idea be­hind prov­ing de­ter­minism of self-mod­ifi­ca­tion isn’t that this guaran­tees safety, but that if you prove the self-mod­ifi­ca­tion sta­ble the AI might work, whereas if you try to get by with no proofs at all, doom is guaran­teed.

Paul Christiano argued against this at length in Stable self-improvement as an AI safety problem, concluding as follows:

But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI.

Note that the above talked about "stable self-modification" instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven

... (read more)
At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic. From Safety Engineering for Artificial General Intelligence: Also, from section 2 of Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda: I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.

MIRIx events are funded by MIRI, but we don't decide the topics or anything. I haven't taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don't see it as super relevant. See and the idea of a "safety-story" in for better attempts to characterize what MIRI is looking for.

ETA: From the end of the latter dialogue,

In point of fact, the real reason the author is listing out this methodology is that he's currently trying to do something similar on the problem of aligning Artificial General Intelligence, and he would like to move past “I believe my AGI won't want to kill anyone” and into a headspace more like writing down statements such as “Although the space of potential weightings for this recurrent neural net does contain weight combinations that would figure out how to kill the programmers, I believe that gradient descent on loss function L will only access
... (read more)
3Rob Bensinger
Also the discussion of deconfusion research in and , and the sketch of 'why this looks like a hard problem in general' in and .
2Rohin Shah
I don't have particular people in mind, it's more of a general "vibe" I get from talking to people. In the past, when I've stated the empirical claim, some people agreed with it, but upon further discussion it turned out they actually agreed with the normative claim. Hence my first question, which was to ask whether or not people believe the empirical claim.
8Vanessa Kosoy
a) I believe a weaker version of the empirical claim, namely that the catastrophe is not nearly inevitable but not unlikely. That is, I can imagine different worlds in which the probability of the catastrophe is different, and I have uncertainty over in which world we actually are, s.t. in average the probability is sizable. b) I think that the argument you gave is sort of correct. We need to augment it by: the minimal requirement from the AI is, it needs to effectively block all competing dangerous AI projects, without also doing bad things (which is why you can't just give it the zero utility function). Your counterargument seems weak to me because, moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there). I think that whatever your AI is, given that is satisfies the minimal requirement, some kind of utility-maximization-like behavior is likely to arise. Coming at it from a different angle, complicated systems often fail in unexpected ways. The way people solve this problem in practice is by a combination of mathematical analysis and empirical research. I don't think we have many examples of complicated systems where all failures were avoided by informal reasoning without either empirical or mathematical backing. In the case of superintelligent AI, empirical research alone is insufficient because, without mathematical models, we don't know how to extrapolate empirical results from current AIs to superintelligent AIs, and when superintelligent algorithms are already here it will probably be too late. c) I think what we can (and should) realistically aim for is, having a mathematical theory of AI, and having a mathematical model of our particular AI, such that in this model we can prove the AI is safe. This model will have some assumptions and parameters that will need to be verified/measured in oth
2Rohin Shah
I agree with a). c) seems to me to be very optimistic, but that's mostly an intuition, I don't have a strong argument against it (and I wouldn't discourage people who are enthusiastic about it from working on it). The argument in b) makes sense; I think the part that I disagree with is: The counterargument is "current AI systems don't look like long term planners", but of course it is possible to respond to that with "AGI will be very different from current AI systems", and then I have nothing to say beyond "I think AGI will be like current AI systems".
4Vanessa Kosoy
Well, any system that satisfies the Minimal Requirement is doing long term planning on some level. For example, if your AI is approval directed, it still needs to learn how to make good plans that will be approved. Once your system has a superhuman capability of producing plans somewhere inside, you should worry about that capability being applied in the wrong direction (in particular due to mesa-optimization / daemons). Also, even without long term planning, extreme optimization is dangerous (for example an approval directed AI might create some kind of memetic supervirus). But, I agree that these arguments are not enough to be confident of the strong empirical claim.
I believe the empirical claim. As I see it, the main issue is Goodhart: an AGI is probably going to be optimizing something, and open-ended optimization tends to go badly. The main purpose of proof-level guarantees is to make damn sure that the optimization target is safe. (You might imagine something other than a utility-maximizer, but at the end of the day it's either going to perform open-ended optimization of something, or be not very powerful.) The best analogy here is something like an unaligned wish-granting genie/demon. You want to be really careful about wording that wish, and make sure it doesn't have any loopholes. I think the difficulty of getting those proof-level guarantees is more conceptual than technical: the problem is that we don't have good ways to rigorously express many of the core ideas, e.g. the idea that physical systems made of atoms can "want" things. Once the core problems of embedded agency are resolved, I expect the relevant guarantees will not be difficult.
7Rohin Shah
Does it make a difference if the optimization target is itself being learned? What if we have intuitive arguments + tests that suggest that the optimization target is safe?
Still unsafe, in both cases. The second case is simpler. Think about it in analogy to a wish-granting genie/demon: if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon? I certainly wouldn't bet on it. The problem here is that the AI is smarter than we are, and can find loopholes we will not think of. The first case is more subtle, because most of the complexity is hidden under a human-intuitive abstraction layer. If we had an unaligned genie/demon and said "I wish for you to passively study me for a year, learn what would make me most happy, and then give me that", then that might be a safe wish - assuming the genie/demon already has an appropriate understanding of what "happy" means, including things like long-term satisfaction etc. But an AI will presumably not start with such an understanding out the gate. Abstractly, the AI can learn its optimization target, but in order to do that it needs a learning target - the thing it's trying to learn. And that learning target is itself what needs to be aligned. If we want the AI to learn what makes humans "happy", in a safe way, then whatever it's using as a proxy for "happiness" needs to be a safe optimization target. On a side note, Yudkowsky's "The Hidden Complexity of Wishes" is in many ways a better explanation of what I'm getting at. The one thing it doesn't explain is how "more powerful" in the sense of "ability to grant more difficult wishes" translates into a more powerful optimizer. But that's a pretty easy jump to make: wishes require satisficing, so we use the usual approach of a two-valued utility function.
4Rohin Shah
I wasn't imagining just input-output tests in laboratory conditions, which I agree are insufficient. I was thinking of studying counterfactuals, e.g. what the optimization target would suggest doing under the hypothetical scenario where it has lots of power. Alternatively, you could imagine tests of the form "pose this scenario, and see how the AI thinks about it", e.g. to see whether the AI runs a check for whether it can deceive humans. (Yes, this assumes strong interpretability techniques that we don't yet have. But if you want to claim that only proofs will work, you either need to claim that interpretability techniques can never be developed, or that even if they are developed they won't solve the problem.) Also, I probably should have mentioned this in the previous comment, but it's not clear to me that it's accurate to model AGI as an open-ended optimizer, in the same way that that's not a great model of humans. I don't particularly want to debate that claim, because those debates never help, but it's a relevant fact to understanding my position.

I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to "see whether the AI runs a check for whether it can deceive humans", then I expect systems which provably don't do that won't be much extra work. So we might disagree less on that front than it first seemed.

The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don't think we need to think of it as truly open-ended in order to use any of the above arguments, especially the wish-granting analogy. The relevant point is that limited optimization implies limited wish-granting ability. In order to grant more "difficult" wishes, the AI needs to steer the universe into a smaller chunk of state-space - in other words, it needs to perform stronger optimization. So AI with limited optimization capability will be safer to exactly the extent that they are unable to grant unsafe wishes - i.e. the chunks of state-space which they can access just don't contain really bad outcomes.

4Rohin Shah
Perhaps the disagreement is in how hard it is to prove things vs. test them. I pretty strongly disagree with The version based on testing has to look at a single input scenario to the AI, whereas the proof has to quantify over all possible scenarios. These seem wildly different. Compare to e.g. telling whether Alice is being manipulated by Bob by looking at interactions between Alice and Bob, vs. trying to prove that Bob will never be manipulative. The former seems possible, the latter doesn't.
Three possibly-relevant points here. First, when I say "proof-level guarantees will be easy", I mean "team of experts can predictably and reliably do it in a year or two", not "hacker can do it over the weekend". Second, suppose we want to prove that a sorting algorithm always returns sorted output. We don't do that by explicitly quantifying over all possible outputs. Rather, we do that using some insights into what it means for something to be sorted - e.g. expressing it in terms of a relatively small set of pairwise comparisons. Indeed, the insights needed for the proof are often exactly the same insights needed to design the algorithm. Once you've got the insights and the sorting algorithm in hand, the proof isn't actually that much extra work, although it will still take some experts chewing on it a bit to make sure it's correct. That's the sort of thing I expect to happen for friendly AI: we are missing some fundamental insights into what it means to be "aligned". Once those are figured out, I don't expect proofs to be much harder than algorithms. Coming back to the "see whether the AI runs a check for whether it can deceive humans" example, the proof wouldn't involve writing the checker and then quantifying over all possible inputs. Rather, it would involve writing the AI in such a way that it always passes the check, by construction - just like we write sorting algorithms so that they will always pass an is_sorted() check by construction. Third, continuing from the previous point: the question is not how hard it is to prove compared to test. The question is how hard it is to build a provably-correct algorithm, compared to an algorithm which happens to be correct even though we don't have a proof.
8Rohin Shah
This was also what I was imagining. (Well, actually, I was also considering more than two years.) It sounds like our disagreement is the one highlighted in Realism about rationality. When I say we could check whether the AI is deceiving humans, I don't mean that we have a check that succeeds literally 100% of the time because we have formalized a definition of "deception" that gives us a perfect checker. I don't think notions like "deception", "aligned", "want", "optimize", etc. have a clean formal definition that admits a 100% successful checker. I do think that these notions do tend to have extremes that can be reliably identified, even if there are edge cases where it is unclear. This makes testing easy, while proofs remain very difficult. Jumping back to the original question, it sounds like the reason that you think that if we don't have proofs we are doomed, is that conditional on us not having proofs, we must not have had any other methods of gaining confidence (such as testing), and so we must be flying blind. Is that right? If so, how do you square this with other engineering disciplines, which typically place most of the confidence in safety on comprehensive, expensive testing (think wind tunnels for rockets or crash tests for cars)? Perhaps this is also explained by realism about rationality -- maybe physical phenomena aren't amenable to crisp formal definitions, but "alignment" is.

It does sound like our disagreement is the same thing outlined in Realism about Rationality (although I disagree with almost all of the "realism about rationality" examples in that post - e.g. I don't think AGI will necessarily be an "agent", I don't think Turing machines or Kolmogorov complexity are useful foundations for epistemology, I'm not bothered by moral intuitions containing contradictions, etc).

I would also describe my "no proofs => doomed" view, not as the proofs being causally important, but as the proofs being evidence of understanding. If we don't have the proofs, it's highly unlikely that we understand the system well enough to usefully predict whether it is safe - but the proofs themselves play a relatively minor role.

I do not know of any engineering discipline which places most of the confidence in safety on comprehensive, expensive testing. Every single engineering discipline I have ever studied starts from understanding the system under design, the principles which govern its function, and designs a system which is expected to be safe based on that understanding. As long as those underlying principles are ... (read more)

3Donald Hobson
The problem with tests is that the AI behaving well when weak enough to be tested doesn't guarantee it will continue to do so. If you are testing a system, that means that you are not confidant that it is safe. If it isn't safe, then your only hope is for humans to stop it. Testing an AI is very dangerous unless you are confidant that it can't harm you. A paperclip maximizer would try to pass your tests until it was powerful enough to trick its way out and take over. Black box testing of arbitrary AI's gets you very little safety. Also some peoples intuitions think that a smile maximizing AI is a good idea. If you have a straightforward argument that appeals to the intuitions of the average Joe Blogs, and can't be easily formalized, then I would take the difficulty formalizing it as evidence that the argument is not sound. If you take a neural network and train it to recognize smiling faces, then attach that to AIXI, you get a machine that will appear to work in the lab, when the best it can do is make the scientists smile into its camera. There will be an intuitive argument about how it wants to make people smile, and people smile when they are happy. The AI will tile the universe with cameras pointed at smiley faces as soon as it escapes the lab.
2Rohin Shah
See response to johnswentworth above.
1David Scott Krueger (formerly: capybaralet)
A slightly misspecified reward function can lead to anything from perfectly aligned behavior to catastrophic failure. So I think we need much stronger and more formal arguments to believe that catastrophe is almost inevitable than EY's genie post provides.
1David Scott Krueger (formerly: capybaralet)
I think a potentially more interesting question is not about running a single AI system, but rather the overall impact of AI technology (in a world where we don't have proofs of things like beneficence). It would be easier to hold the analogue of the empirical claim there.
2Rohin Shah
I'd also argue against the empirical claim in that setting; do you agree with the empirical claim there?
1David Scott Krueger (formerly: capybaralet)
I hold a nuanced view that I believe is more similar to the empirical claim than your views. I think what we want is an extremely high level of justified confidence that any AI system or technology that is likely to become widely available is not carrying a significant and non-decreasing amount of Xrisk-per-second. And it seems incredibly difficult and likely impossible to have such an extremely high level of justified confidence. Formal verification and proof seem like the best we can do now, but I agree with you that we shouldn't rule out other approaches to achieving extreme levels of justified confidence. What it all points at to me is the need for more work on epistemology, so that we can begin to understand how extreme levels of confidence actually operate.
6Rohin Shah
This sounds like the normative claim, not the empirical one, given that you said "what we want is..."
1David Scott Krueger (formerly: capybaralet)
Yep, good catch ;) I *do* put a non-trivial weight on models where the empirical claim is true, and not just out of epistemic humility. But overall, I'm epistemically humble enough these days to think it's not reasonable to say "nearly inevitable" if you integrate out epistemic uncertainty. But maybe it's enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully? Or are you just trying to see if anyone can defeat the epistemic humility "trump card"?
2Rohin Shah
Partly (I'm surprised by how confident people generally seem to be, but that could just be a misinterpretation of their position), but also on my inside view the empirical claim is not true and I wanted to see if there were convincing arguments for it. Yeah, I'd be interested in your answers anyway.
1David Scott Krueger (formerly: capybaralet)
I'm not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I'm actually more interested in hearing your take on those lines of argument than saying mine ATM :P
6Rohin Shah
Re: convergent rationality, I don't buy it (specifically the "convergent" part). Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile. But really my answer is "there are lots of ways you can get confidence in a thing that are not proofs". I think the strongest argument against is "when you have an adversary optimizing against you, nothing short of proofs can give you confidence", which seems to be somewhat true in security. But then I think there are ways that you can get confidence in "the AI system will not adversarially optimize against me" using techniques that are not proofs. (Note the alternative to proofs is not trial and error. I don't use trial and error to successfully board a flight, but I also don't have a proof that my strategy is going to cause me to successfully board a flight.)
1David Scott Krueger (formerly: capybaralet)
Totally agree; it's an under-appreciated point! Here's my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don't actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.) The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong. This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems. I'm personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the "early days" of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of "social epistemology" would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I'd argue we're in the process of failing catastrophically at that)
[-]Wei DaiΩ5110

A downside of the portfolio approach to AI safety research

Given typical human biases, researchers of each AI safety approach are likely to be overconfident about the difficulty and safety of the approach they're personally advocating and pursuing, which exacerbates the problem of unilateralist's curse in AI. This should highlighted and kept in mind by practitioners of the portfolio approach to AI safety research (e.g., grant makers). In particular it may be a good idea to make sure researchers who are being funded have a good understanding of the overconfidence effect and other relevant biases, as well as the unilateralist's curse.

These biases seem very important to keep in mind! If "AI safety" refers here only to AI alignment, I'd be happy to read about how overconfidence about the difficulty/safety of one's approach might exacerbate the unilateralist's curse.

I'm posting a few research directions in my research agenda about which I haven't written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven't got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.

6Vanessa Kosoy
In last year's essay about my research agenda I wrote about an approach I call "learning by teaching" (LBT). In LBT, an AI is learning human values by trying to give advice to a human and seeing how the human changes eir behavior (without an explicit reward signal). Roughly speaking, if the human permanently changes eir behavior as a result of the advice, then one can assume the advice was useful. Partial protection against manipulative advice is provided by a delegation mechanism, which ensures the AI only produces advice that is in the space of "possible pieces of advice a human could give" in some sense. However, this protection seems insufficient since it allows for giving all arguments in favor of a position without giving any arguments against a position. To add more defense against manipulation, I propose to build on the "AI debate" idea. However, in this scheme, we don't need more than one AI. In fact, this is a general fact: for any protocol P involving multiple AIs, there is a protocol Q involving just one AI that works (at least roughly, qualitatively) just as well. Proof sketch: If we can prove that under assumptions X, the protocol P is safe/effective, then we can design a single AI Q which has assumptions X baked into its prior. Such an AI would be able to understand that simulating protocol P would lead to a safe/effective outcome, and would only choose a different strategy if it leads to an even better outcome under the same assumptions. The way we use "AI debate" is not by implementing an actual AI debate. Instead, we use it to formalize our assumptions about human behavior. In ordinary IRL, the usual assumption is "a human is a (nearly) optimal agent for eir utility function". In the original version of LBT, the assumption was of the form "a human is (nearly) optimal when receiving optimal advice". In debate-LBT the assumption becomes "a human is (nearly) optimal* when exposed to a debate between two agents at least one of which is giving optimal
5Vanessa Kosoy
A variant of Christiano's IDA amenable to learning-theoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions: (i) If we make the null prediction always, the expected utility will be lower bounded by some constant. (ii) If our predictions sample the n-step future for a given policy π, then our expected utility will be lower bounded by some function F(u,n) of the the expected utility u of π and n. F is s.t. for sufficiently low u, F(u,n)≤u but for sufficiently high u, F(u,n)>u (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it's natural to assume F(F(u,n),m)≈F(u,nm) for u>>0. The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad self-fulfilling prophecies and "corrupting" predictions that cause damage just by seeing them. However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric d. Then, total variation distance is replaced by Kantorovich-Rubinstein distance defined w.r.t. d. The question is, where do we get the m
4Vanessa Kosoy
This idea was inspired by a discussion with Discord user @jbeshir Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrix r(s,t)). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game. The simplest setting would be, every time you visit a state, you learn the reward of all previous states w.r.t. the reward function of the current state. Alternatively, every time you visit a state, you can ask about the reward of one previously visited state w.r.t. the reward function of the current state. This is the analogue of classical reinforcement learning with an explicit reward channel. We can now try to prove a regret bound, which takes the form of an ϵ-Nash equilibrium condition, with ϵ being the regret. More complicated settings would be analogues of Delegative RL (where the advisor also follows the reward function of the current state) and other value learning protocols. This seems like a more elegant way to model "corruption" than as a binary or continuous one dimensional variable like I did before. *Note that although for general games, even if they are purely coorperative, Nash equilibria can be suboptimal due to coordination problems, for this type of games it doesn't happen: in the purely cooperative case, the Nash equilibrium condition becomes the Bellman equation that implies global optimality.
4Vanessa Kosoy
It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won't go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form "program template P augmented by a hard-coded bit string b of length n". Such a set contains 2n hypotheses, and its total probability mass is approximately 2−|P|, which is significant for short P (even when n is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both P and b. Such a contest is very hard since learning n bits can take much time for large n. Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string x is defined as follows. First, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs (Q,y) where Q is a program, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the sophistication of x. For our problem, we are interested in the minimal Q itself: I call it the "sophisticated core" of x. We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping. Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of P. Th

You are handed a hypercomputer, and allowed to run any code you like on it. You can then take 1Tb of data from your computations and attach it to a normal computer. The hypercomputer is removed. You are then handed a magic human utility function. How do you make an FAI with these resources?

The normal computer is capable of running a highly efficient super-intelligence. The hypercomputer can do a brute force search for efficient algorithms. The idea is to split FAI into building a capability module, and a value module.

Assume that given a hypercomputer and the magic utility function m, we could build an FAI F(m). Every TB of data encodes some program A(u) that takes a utility function u as input. For all A and u, ask F(u) if A(u) is aligned with F(u). (We must construct F not to vote strategically here.) Save that A' which gets approved by the largest fraction of F(u). Sanity check that this maximum fraction is very close to 1. Run A'(m).

The telos of life is to collect and preserve information. That is to say: this is the defining behavior of a living system, so it is an inherent goal. The beginning of life must have involved some replicating medium for storing information. At first, life actively preserved information by replicating, and passively collected information through the process of evolution by natural selection. Now life forms have several ways of collecting and storing information. Genetics, epigenetic, brains, immune systems, gut biomes, etc.

Obviously a system that collects a... (read more)

Is this open thread not going to be a monthly thing?

FWIW I liked reading the comment threads here, and would be inclined to participate in the future. But that's just my opinion. I'm curious if more senior people had reasons for not liking the idea?

4Rohin Shah
I expected that it would be better for me to polish ideas before posting on the forum, and treated this as an experiment to check. I think it broadly confirmed my original view, so I'm not very likely to post top-level comments on open threads in the future, and I told the admins so. I don't know what their decision process was after that. (Possibly they expected that future open threads would be much quieter, since the two biggest comment threads here were both started by my top-level comments.)
I felt a bit uncertain about doing one every month, and was planning to start another one in October. Depending on how that one goes we might go with a monthly schedule, or maybe every two months is the right way to go.

I've just been invited to this forum. How do I decide whether to put a post on the Alignment Forum vs. Less Wrong?

Basically, whether you think it's primarily related to alignment vs. rationality. (Everything on the AF is also on LW, but the reverse isn't true.) The feedback loop if you're posting too much or stuff that isn't polished enough is downvotes (or insufficient upvotes).

I saw this thread complaining about the state of peer review in machine learning. Has anyone thought about trying to design a better peer review system, then creating a new ML conference around it and also adding in a safety emphasis?

4Rohin Shah
Yes (though just the peer review, not the safety emphasis). I can send you thoughts about it if you'd like, email me at <my LW username> at gmail. I thought about the differential development point and came away thinking it would be net positive, and convinced a few other people as well, even if it's just modifying peer review without having safety researchers run the conference.
Cool! I guess another way of thinking about this is not a safety emphasis so much as a forecasting emphasis. Reminds me of our previous discussion here. If someone could invent new scientific institutions which reward accurate forecasts about scientific progress, that could be really helpful for knowing how AI will progress and building consensus regarding which approaches are safe/unsafe.
2Rohin Shah
+1, that's basically the story I have in mind. I think of it as less about forecasting and more about understanding deep learning and how it works, but I think it serves basically the same purpose: it's helpful for knowing how AI will progress and building consensus about what's safe / unsafe.
I'm vaguely worried that this might be net-negative for ML in particular, if you're worried about differential tech development.
The idea is that if the conference is run by people who are interested in safety, they can preferentially accept papers which are good from a differential technological development point of view.