Matthew Barnett

Just someone who wants to learn about the world.

I change my views often. Anything I wrote that's more than 10 days old should be treated as potentially outdated.

Matthew Barnett's Comments

The Epistemology of AI risk

It could still be that the level of absolute risk is still low, even after taking this into account. I concede that estimating risks like these are very difficult.

The Epistemology of AI risk
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.

I agree, though I tend to think the costs associated with failing to catch deception will be high enough that any major team will be likely to bear the costs. If some team of researchers doesn't put in the effort, a disaster would likely occur that would be sub-x-risk level, and this would set a precedent for safety standards.

In general, I think humans tend to be very risk averse when it comes to new technologies, though there are notable exceptions (such as during wartime).

Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety?

A full solution to AI safety will necessarily be contingent on the architectures used to build AIs. If we don't understand a whole lot about those architectures, this limits our abilities to do concrete work. I don't find the argument entirely compelling because,

  • It seems reasonably likely that AGI will be built using more-or-less the deep learning paradigm, perhaps given a few insights, and therefore productive work can be done now, and
  • We can still start institutional work, and develop important theoretical insights.

But even given these qualifications, I estimate that the vast majority of productive work to make AIs safe will be completed when the AI systems are actually built, rather than before. It follows that most work during this pre-AGI period might miss important details and be less effective than we think.

And it seems to me like “probably helps somewhat” is enough when it comes to existential risk

I agree, which is why I spend a lot of my time reading and writing posts on Lesswrong about AI risk.

The Epistemology of AI risk

If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I'd assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.

Is there any core argument in the book Superintelligence that is no longer widely accepted among AI safety researchers?

I can't speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I've talked to. In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers still think that fast takeoff is likely).

The Epistemology of AI risk

[ETA: It's unfortunate I used the word "optimism" in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling. I'm pessimistic in a sense, since I think by default our future civilization's values will be quite different from mine in important ways.]

My opinion is that AI is likely to be an important technology whose effects will largely determine our future civilization, and the outlook for humanity. And given that AI will be so large, its impact will also largely determine whether our values go extinct or survive. That said, it's difficult to understand the threat to our values from AI without a specific threat model. I appreciate trying to find specific ways that AI can go wrong, but I currently think

  • We are probably not close enough to powerful AI to have a good understanding of the primary dynamics of an AI takeoff, and therefore what type of work will help our values survive one.
  • The way our values might go extinct will probably happen in some unavoidable manner that's not related to the typical sources of AI risk. In other words, it's likely that just general value drift and game theoretic incentives will do more to destroy the value of the long-term future than technical AI errors.
  • The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.

If AI does go wrong in one of the ways you have identified, it seems difficult to predict which one (though we can do our best to guess). It seems even harder to do productive work, since I'm skeptical of very short timelines.

Historically, our models of AI development have been notoriously poor. Ask someone from 10 years ago what they think AI might look like, and it seems unlikely that they would have predicted deep learning in a way that would have been useful to make it safer. I suspect that unless AI is very soon, it will be very hard to do specific technical work to make it safer.

Have epistemic conditions always been this bad?
For example would you endorse making LW a "free speech zone" or try to push for blanket acceptance of free speech elsewhere?

I think limiting free speech for specific forums of discussion makes sense, given that it is very difficult to maintain a high-quality community without doing so. I think that declaring that a particular place a "free speech zone" tends to invite the worst people to gather in those places (I've seen this over and over again on the internet).

More generally, I was talking about societal norms to punish speech deemed harmful. I think there's a relevant distinction between a professor getting fired for saying something deemed politically harmful, and an internet forum moderating discussion.

The Epistemology of AI risk
when we look at the distribution of opinion among those who have really “engaged with the arguments”, we are left with a substantial majority—maybe everyone but Hanson, depending on how stringent our standards are here!—who do believe that, one way or another, AI development poses a serious existential risk.

For what it's worth, I have "engaged with the arguments" but am still skeptical of the main arguments. I also don't think that my optimism is very unusual for people who work on the problem, either. Based on an image image from about five years ago (the same time Nick Bostrom's book came out), most people at FHI were pretty optimistic. Since then, it's my impression that researchers have become even more optimistic, since more people appear to accept continuous takeoff and there's been a shift in arguments. AI Impacts recently interviewed a few researchers who were also skeptical (including Hanson), and all of them have engaged in the main arguments. It's unclear to me that their opinions are actually substantially more optimistic than average.

Have epistemic conditions always been this bad?

Second, I think it is worth pointing out that there are definitely instances where, at least in my opinion, “canceling” is a valid tactic. Deplatforming violent rhetoric (e.g. Nazism, Holocaust denial, etc.) comes to mind as an obvious example.

If the people who determine what is cancel-able could consistently distinguish between violent rhetoric and non-violent rhetoric, and the boundary never expanded in some random direction, I would agree with you.

In practice, what often happens is that someone is cancelled over accusations of being a Nazi (or whatever), even when they aren't. Since defending a Nazi tends to make people think you are secretly also a Nazi, the people being falsely accused tend to get little support from outsiders.

Also, given that many views that EA endorse could easily fall outside of the window of what's considered appropriate speech one day (such as reducing wild animal suffering, negative utilitarianism, genetic enhancement), it is probably better to push for a blanket acceptance of free speech rather than just hope that future people will tolerate our ideas.

Inner alignment requires making assumptions about human values
Is your point mostly centered around there being no single correct way to generalize to new domains, but humans have preferences about how the AI should generalize, so to generalize properly, the AI needs to learn how humans want it to do generalization?

Pretty much, yeah.

The above sentence makes lots of sense to me, but I don't see how it's related to inner alignment

I think there are a lot of examples of this phenomenon in AI alignment, but I focused on inner alignment for two reasons

  • There's a heuristic that a solution to inner alignment should be independent of human values, and this argument rebuts that heuristic.
  • The problem of inner alignment is pretty much the problem of how to get a system to properly generalize, which makes "proper generalization" fundamentally linked to the idea.
Inner alignment requires making assumptions about human values
I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.

I find this interesting but I'd be surprised if it were true :). I look forward to seeing it in the upcoming posts.

That said, I want to draw your attention to my definition of catastrophe, which I think is different than the way most people use the term. I think most broadly, you might think of a catastrophe as something that we would never want to happen even once. But for inner alignment, this isn't always helpful, since sometimes we want our systems to crash into the ground rather than intelligently optimizing against us, even if we never want them to crash into the ground even once. And as a starting point, we should try to mitigate these malicious failures much more than the benign ones, even if a benign failure would have a large value-neutral impact.

A closely related notion to my definition is the term "unacceptable behavior" as Paul Christiano has used it. This is the way he has defined it,

In different contexts, different behavior might be acceptable and it’s up to the user of these techniques to decide. For example, a self-driving car trainer might specify: Crashing your car is tragic but acceptable. Deliberately covering up the fact that you crashed is unacceptable.

It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.

Inner alignment requires making assumptions about human values
Can you explain why you think there _IS_ a "true" factor

Apologies for the miscommunication, but I don't think there really is an objectively true factor. It's true to the extent that humans say that it's the true reward function, but I don't think it's a mathematical fact. That's part of what I'm arguing. I agree with what you are saying.

Load More