PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter. http://rohinshah.com/
The short version / summary will be the Alignment Newsletter for this week (sent out Wed 10am Pacific Time); I may incorporate feedback provided here or on the doc.
First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk.
Okay. What's the argument that the risk is great (I assume this means "very bad" and not "very likely" since by hypothesis it is unlikely), or that we need a lot of time to solve it?
Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is.
I agree with this; I don't think this is one of our cruxes. (I do think that in most cases, if we have all the information about the situation, it will be fairly clear whether something is intent aligned or not, but certainly there are situations in which it's ambiguous. I think corrigibility is better-informally-defined, though still there will be ambiguous situations.)
Is IRL intent aligned?
Depends on the details, but the way you describe it, no, it isn't. (Though I can see the fuzziness here.) I think it is especially clear that it is not corrigible.
It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.
Yup, I agree (with the caveat that it doesn't have to be a human's interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.
If the bridge can never collapse as long as the outside forces don't exceed K, then resonance is covered as well (as long as it is produced by forces below K).
I meant that K was set considering wind forces, cars, etc. and was set too low to account for resonance, because you didn't think about resonance beforehand.
(I guess resonance doesn't involve large forces, it involves coordinated forces. The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn't think of it.)
Nevertheless most engineering projects make heavy use of theory.
I'm not denying that? I'm not arguing against theory in general; I'm arguing against theoretical safety guarantees. I think in practice our confidence in safety often comes from empirical tests.
I'm curious whether proving a weakly feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?
Probably? Honestly, I'm don't think you even need to prove the subjective regret bound; if you wrote down assumptions that I agree are realistic and capture safety (such that you could write code that determines whether or not an AI system is safe) that alone would qualify. It would be fine if it sometimes said things are unsafe when they are safe, as long as it isn't too conservative; a weak feasibility result would help show that it isn't too conservative.
I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption.
Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then "human thinking for a day" is not something you can express.
I broadly agree with all of this, thanks :)
By "claims about a space of questions" you mean "a claim about every question from a space of questions"?
I just wrote incorrectly; I meant "the agent can choose a question from a space of questions and make a claim about it". If you want to support claims about a space of questions, you could allow quantifiers in your questions.
However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior.
I mean, sure, but any alignment scheme is going to have to assume some amount of correctness in the human-generated information it is given. You can't learn about preferences if you model humans as arbitrarily wrong about their preferences.
I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem.
Okay, so there seem to be two disagreements:
The first one seems primarily about our disagreements on the utility of theory, which I'll get to later.
For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely (maybe the first few AGIs don't think about simulations; maybe it's impossible to construct such a convincing hypothesis). I especially don't see the argument that it is more likely than the failure mode in which a goal-directed AGI is optimizing for something different from what humans want.
(You might respond that intent alignment brings risk down from say 10% to 3%, whereas your agenda brings risk down from 10% to 1%. My response would be that once we have successfully figured out intent alignment to bring risk from 10% to 3%, we can then focus on building a good prior to bring the risk down from 3% to 1%. All numbers here are very made up.)
For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result.
My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won't hold for the vast majority of cases that neural networks have been applied to today.
I don't find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition).
I am not sure why you have this impression.
Consider this example:
You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can't be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance.
The broader point is that when the environment has lots of complicated interaction effects, and you must make assumptions, it is very hard to find assumptions that actually hold.
And yet, theory obviously played and plays a central role in science, so I don't understand whence the fatalism.
The areas of science in which theory is most central (e.g. physics) don't require assumptions about some complicated stuff; they simply aim to describe observations. It's really the assumptions that make me pessimistic, which is why it would be a significant update if I saw:
a mathematical definition of safety that I thought actually captured "safety" without "passing the buck"
It would similarly update me if you had a piece of code that (perhaps with arbitrary amounts of compute) could take in an AI system and output "safe" or "unsafe", and I would trust that output. (I'd expect that a mathematical definition could be turned into such a piece of code if it doesn't "pass the buck".)
You might respond that intent alignment requires assumptions too, which I agree with, but the theory-based approach requires you to limit your assumptions to things that can be written down in math (e.g. this function is K-Lipschitz) whereas a non-theory-based approach can use "handwavy" assumptions (e.g. a human thinking for a day is safe), which drastically opens up the space of options and makes it more likely that you can find an assumption that is actually mostly true.
That seems like it would be an extremely not cost-effective way of making progress.
Yeah, I broadly agree; I mostly don't understand MIRI's position and thought you might share it, but it seems you don't. I agree that overall it's a tough problem. My personal position would be to do it publicly anyway; it seems way better to have an approach to AI that we understand than the current approach, even if it shortens timelines. (Consider the unilateralist curse; but also consider that other people do agree with me, if not the people at MIRI / LessWrong.)
In my double descent newsletter, I said:
This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training "fixes" the weights to memorize noise in a different way that generalizes better. While I can't rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn't "come into effect" after the interpolation threshold.)
One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.
I don't buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is ln(1/N)=−2.3 when N=10 (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we'd expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.
There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy "overwhelms" the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can't be true. When training on just L2 regularization, the gradient descent update is:
w=w−λw=(1−λ)w=cw for some constant c.
For MLPs and CNNs with relu activations, if you multiply all the weights by a constant, the logits also get multiplied by a constant, no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can't see a double descent on test error in this setting. (This doesn't eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can't happen in the "first train to zero error with cross-entropy and then regularize" setting.)The paper tests with CNNs, but doesn't mention what activation they use. Still, I'd find it very surprising if double descent only happened for a particular activation function.
One phrase to look for is "open-world" or "open-universe", in which it isn't known which objects exist. See e.g. this paper. That said, this is still in the realm of PPLs -- to my knowledge this was developed for BLOG, and then spread to other PPLs.
how does this work?
You have a prior; you choose to do the experiment with highest VOI to get a posterior, and then you choose the best answer given that posterior. I'm pretty sure I could calculate this for many of their scenarios.
Just a note that in the link that Wei Dai provides for "Relevant powerful agents will be highly optimized", Eliezer explicitly assigns '75%' to 'The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes, will have been subject to strong, general optimization pressures.'
Yeah, it's worth noting that I don't understand what this means. By my intuitive read of the statement, I'd have given it 95+% of being true, in the sense that you aren't going to randomly stumble upon a powerful agent. But also by my intuitive read, the negative example given on that page would be a positive example:
An example of a scenario that negates RelevantPowerfulAgentsHighlyOptimized is KnownAlgorithmNonrecursiveIntelligence, where a cognitively powerful intelligence is produced by pouring lots of computing power into known algorithms, and this intelligence is then somehow prohibited from self-modification and the creation of environmental subagents.
On my view, known algorithms are already very optimized? E.g. Dijkstra's algorithm is highly optimized for efficient computation of shortest paths.
So TL;DR idk what optimized is supposed to mean here.
To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work.
I completely agree with this, but isn't this also true of subjective regret bounds / definition-optimization? Like, when you write (emphasis mine)
Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.
Isn't the assumption about the prior "doing all the work"?
Maybe your point is that there are failure modes that aren't covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places. Just picking one sentence (emphasis mine):
An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
I don't see any reason of principle that significantly limits the strong feasibility results we can expect.
And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world.
That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world. The abstract argument is an attempt to explain it; but I wouldn't have much faith in the abstract argument by itself (which is trying to quantify over all possible ways of getting a strong feasibility result).
However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability.
Idk, you could have a nondisclosure-by-default policy if you were worried about this. Maybe this can't work for you though. (As an aside, I hope this is what MIRI is doing, but they probably aren't.)
Can you explain what you mean here?
Basically what you said right after:
I agree that just saying "subjective regret bound" is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user.