Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by Rohin Shah. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.

New to LessWrong?

142 comments, sorted by Click to highlight new comments since: Today at 2:46 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn't exist any such thing. Often my reaction is "if only there was time to write an LW post that I can then link to in the future". So far I've just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I'm now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they're even understandable.

<unfair rant with the goal of shaking people out of a mindset>

To all of you telling me or expecting me to update to shorter timelines given <new AI result>: have you ever encountered Bayesianism?

Surely if you did, you'd immediately reason that you couldn't know how I would update, without first knowing what I expected to see in advance. Which you very clearly don't know. How on earth could you know which way I should update upon observing this new evidence? In fact, why do you even care about which direction I update? That too shouldn't give you much evidence if you don't know what I expected in the first place.

Maybe I should feel insulted? That you think so poorly of my reasoning ability that I should be updating towards shorter timelines every time some new advance in AI comes out, as though I hadn't already priced that into my timeline estimates, and so would predictably update towards shorter timelines in violation of conservation of expected evidence? But that only follows if I expect you to be a good reasoner modeling me as a bad reasoner, which probably isn't what's going on.

</unfair rant>

My actual guess is that people notice a discrepancy between their ver... (read more)

1Not Relevant2y
I think it’s possible some people are asking these questions disrespectfully, but re: bio anchors, I do think that the report makes a series of assumptions whose plausibility can change over time, and thus your timelines can shift as you reweight different bio anchors scenarios while still believing in bio anchors. To me, the key update on bio anchors seems like I no longer believe the preemptive update against the human lifetime anchor. It was justified largely on the grounds of “someone could’ve done it already” and “ML is very sample inefficient”, but it seems like those should be reevaluated given that as we get closer systems like PaLM exhibit capabilities remarkable enough that I’m not sold that a different training setup couldn’t be doing really good RL with the same data/compute implying that the bottleneck could just be algorithmic progress, and separately that few-shot learning is now much more common than the many-shot learning of prior ML progress. I still think that the “number of RL episodes lasting Y seconds with the agent using X flop/s” anchor is a separate good one, and while I’m now much less convinced we’ll need the 1e16 flop/s models estimated in bio-anchors (and separately Chinchilla scaling laws + conservation of expected evidence about more improvements also weren’t incorporated into the exponent and should probably shift it down) I think the NN anchors still have predictive value and slightly lengthen timelines. Also, though, insofar as people are asking you to update on Gato, I agree that makes little sense.
4Rohin Shah2y
I agree your timelines can and should shift based on evidence even if you continue to believe in the bio anchors framework. Personally, I completely ignore the genome anchor, and I don't buy the lifetime anchor or the evolution anchor very much (I think the structure of the neural net anchors is a lot better and more likely to give the right answer). Animals with smaller brains (like bees) are capable of few-shot learning, so I'm not really sure why observing few-shot learning is much of an update. See e.g. this post.

Essentially, the problem is that 'evidence that shifts Bio Anchors weightings' is quite different, more restricted, and much harder to define than the straightforward 'evidence of impressive capabilities'. However, the reason that I think it's worth checking if new results are updates is that some impressive capabilities might be ones that shift bio anchors weightings. But impressiveness by itself tells you very little.

I think a lot of people with very short timelines are imagining the only possible alternative view as being 'another AI winter, scaling laws bend, and we don't get excellent human-level performance on short term language-specified tasks anytime soon', and don't see the further question of figuring out exactly what human-level on e.g. MMLI would imply.

This is because the alternative to very short timelines from (your weightings on) Bio Anchors isn't another AI winter, rather it's that we do get all those short-term capabilities soon, but have to wait a while longer to crack long-term agentic planning because that doesn't come "for free" from competence on short-term tasks, if you're as sample-inefficient as current ML is.

So what we're really looking for isn't systems ... (read more)

4Rohin Shah2y
Yeah, this all seems right to me. It does not seem to me like "can keep a train of thought running" implies "can take over the world" (or even "is comparable to a human"). I guess the idea is that with a train of thought you can do amplification? I'd be pretty surprised if train-of-thought-amplification on models of today (or 5 years from now) led to novel high quality scientific papers, even in fields that don't require real-world experimentation.
4Not Relevant2y
I think this is the best writeup about this I’ve seen, and I agree with the main points, so kudos! I do think that evidence of increasing returns to scale of multi-step chain of thought prompting are another weak datapoint in favor of the human lifetime anchor. I also think there are pretty reasonable arguments that NNs may be more efficient than the human brain at converting flops to capabilities, e.g. if SGD is a better version of the best algorithm that can be implemented on biological hardware. Similarly, humans are exposed to a much smaller diversity of data than LMs (the internet is big and weird), and thus they may get more “novelty” per flop and thus generalize better from less data. My main point here is just that “biology is optimal” isn’t as strong a rejoinder when we’re comparing a process so different from what biology did.

Let's say you're trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?

(I'm assuming here that you can't defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)

First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.

Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal return... (read more)

From the Truthful AI paper:

If all information pointed towards a statement being true when it was made, then it would not be fair to penalise the AI system for making it. Similarly, if contemporary AI technology isn’t sophisticated enough to recognise some statements as potential falsehoods, it may be unfair to penalise AI systems that make those statements.

I wish we would stop talking about what is "fair" to expect of AI systems in AI alignment*. We don't care what is "fair" or "unfair" to expect of the AI system, we simply care about what the AI system actually does. The word "fair" comes along with a lot of connotations, often ones which actively work against our goal.

At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response "but that isn't fair to the AI system" (because it didn't have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.

(This sort of thing happens with mesa optimization -- if you have two objectives that are indistinguishable on the training data, it's "unfair" to expect the AI system to choose... (read more)

I wonder if this use of "fair" is tracking (or attempting to track) something like "this problem only exists in an unrealistically restricted action space for your AI and humans - in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won't be a problem".
4Rohin Shah9mo
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn't fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn't/can't know the truth, compared to a "strict liability" regime.

Consider two methods of thinking:

1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by "rolling out" that model

2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.

I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.

However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.

I think many people on LW tend to use option 1 almost always and my "deference" to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?

Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).

Options 1 & 2 sound to me a lot like inside view and outside view. Fair?

Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven't explained them here.

EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won't be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.

Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah? (My own take is the cop-out-like, "it depends". I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you've put into it, etc.)
6Rohin Shah3y
Correct. I didn't say you should defer to experts, just that if you try to build gears-y models you'll be wrong. It's totally possible that there's no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.
Good point!
4Matt Goldenberg3y
I recently interviewed someone who has a lot of experience predicting systems, and they had 4 steps similar to your two above. 1. Observe the world and see if it's sufficient to other systems to predict based on intuitionistic analogies. 2. If there's not a good analogy, Understand the first principles, then try to reason about the equilibria of that. 3. If that doesn't work, Assume the world will stay in a stable state, and try to reason from that. 4. If that doesn't work, figure out the worst case scenario and plan from there. I think 1 and 2 are what you do with expertise, and 3 and 4 are what you do without expertise.
4Rohin Shah3y
Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily "for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed". I'm also noticing I mean something slightly different by "expertise" than is typically meant. My intended meaning of "expertise" is more like "you have lots of data and observations about the system", e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.
3Tim Liptrot3y
I have been doing political betting for a few months and informally compared my success with strategies 1 and 2. Ex. Predicting the Iranian election 1. I write down the 10 most important iranian political actors (Khameini, Mojtaza, Raisi, a few opposition leaders, the IRGC commanders). I find a public statement about their prefered outcome, and I estimate their power and salience. So Khameini would be preference = leans Raisi, power = 100, salience = 40. Rouhani would be preference = strong Hemmeti, power = 30, salience = 100. Then I find the weighted average position. It's a bit more complicated because I have to linearize preferences, but yeah. 2. The two strat is to predict repeated past events. The opposition has one the last three contested elections in surprise victories, so predict the same outcome. I have found 2 is actually pretty bad. Guess I'm an expert tho.
2Rohin Shah3y
That seems like a pretty bad 2-strat. Something that has happened three times is not a "stable high-level feature of the world". (Especially if the preceding time it didn't happen, which I infer since you didn't say "the last four contested elections".) If that's the best 2-strat available, I think I would have ex ante said that you should go with a 1-strat.
3Tim Liptrot3y
Haha agreed.

One way to communicate about uncertainty is to provide explicit probabilities, e.g. "I think it's 20% likely that [...]", or "I would put > 90% probability on [...]". Another way to communicate about uncertainty is to use words like "plausible", "possible", "definitely", "likely", e.g. "I think it is plausible that [...]".

People seem to treat the words as shorthands for probability statements. I don't know why you'd do this, it's losing information and increasing miscommunication for basically no reason -- it's maybe slightly more idiomatic English, but it's not even much longer to just put the number into the sentence! (And you don't have to have precise numbers, you can have ranges or inequalities if you want, if that's what you're using the words to mean.)

According to me, probabilities are appropriate for making decisions so you can estimate the EV of different actions. (This can also extend to the case where you aren't making a decision, but you're talking to someone who might use your advice to make decisions, but isn't going to understand your reasoning process.) In contrast, words are for describing the state of your reasoning algorithm, which often doesn't have much to d... (read more)

I like this, but it feels awkward to say that something can be not inside a space of "possibilities" but still be "possible". Maybe "possibilities" here should be "imagined scenarios"?
2Rohin Shah9mo
That does seem like better terminology! I'll go change it now.

I like this experiment! Keep 'em coming.

“Burden of proof” is a bad framing for epistemics. It is not incumbent on others to provide exactly the sequence of arguments to make you believe their claim; your job is to figure out whether the claim is true or not. Whether the other person has given good arguments for the claim does not usually have much bearing on whether the claim is true or not.

Similarly, don’t say “I haven’t seen this justified, so I don’t believe it”; say “I don’t believe it, and I haven’t seen it justified” (unless you are specifically relying on absence of evidence being evidence of absence, which you usually should not be, in the contexts that I see people doing this).

I'm not 100% sure this needs to be much longer. It might actually be good to just make this a top-level post so you can link to it when you want, and maybe specifically note that if people have specific confusions/complaints/arguments that they don't think the post addresses, you'll update the post to address those as they come up? (Maybe caveating the whole post under "this is not currently well argued, but I wanted to get the ball rolling on having some kind of link") That said, my main counterargument is: "Sometimes people are trying to change the status quo of norms/laws/etc. It's not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to 'claims that have been reasonably well argued.'"  I think 'burden of proof' isn't quite the right frame but there is something there that still seems important. I think the bad thing comes from distinguishing epistemics vs Overton-norm-fighting, which are in fact separate.
4Rohin Shah3y
I don't really want this responsibility, which is part of why I'm doing all of these on the shortform. I'm happy for you to copy it into a top-level post of your own if you want. I agree this makes sense, but then say "I'm not looking into this because it hasn't been well argued (and my time/attention is limited)", rather than "I don't believe this because it hasn't been well argued".
8Rohin Shah3y
Sometimes people say "look at these past accidents; in these cases there were giant bureaucracies that didn't care about safety at all, therefore we should be pessimistic about about AI safety". I think this is backwards, and that you should actually conclude the reverse: this is evidence that problems tend to be easy, and therefore we should be optimistic about AI safety. This is not just one man's modus ponens -- the key issue is the selection effect. It's easiest to see with a Bayesian treatment. Let's say we start completely uncertain about what fraction of people will care about problems, i.e. uniform distribution over [0, 100]%. In what worlds do I expect to see accidents where giant bureaucracies don't care about safety? Almost all of them -- even if 90% of people care about safety, there will still be some cases where people didn't care and accidents happened; and of course we'd hear about them if so (and not hear about the cases where accidents didn't happen). You can get a strong update against 99.9999% and higher, but by the time you're at 90% the update seems pretty weak. Given how much selection there is, I think even the update against 99% is relatively weak. So really you just don't learn much about how careful people will be by looking at our accident track record (unless you can also quantify the denominator of how many "potential accidents" there could have been). However, it feels pretty notable to me that the vast majority of accidents I hear about in detail are ones where it seems like there were a bunch of obvious mistakes and the accidents would have been prevented had there been a decision-maker who cared (enough) about safety. And unlike the previous paragraph, I do expect to hear about accidents that we couldn't have prevented, so I don't have to worry about selection bias. So it seems like I should conclude that usually problems are pretty easy, and "all we have to do" is make sure people care. (One counterargument is that problems look
7Rohin Shah3y
In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter. By "selection", I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong. Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.
6Rohin Shah3y
You've heard of crucial considerations, but have you heard of red herring considerations? These are considerations that intuitively sound like they could matter a whole lot, but actually no matter how the consideration turns out it doesn't affect anything decision-relevant. To solve a problem quickly, it's important to identify red herring considerations before wasting a bunch of time on them. Sometimes you can even start outlining solutions that turn a bunch of seemingly-crucial considerations into red herring considerations. For example, it might seem like "what is the right system of ethics" is a crucial consideration for AI alignment (after all, you need to know ethics to write down a utility function), but once you decide to instead aim to design algorithms that allow you to build AI systems for any task you have in mind, that turns into a red herring consideration. Here's an example where I argue that, for a specific question, anthropics is a red herring consideration (thus avoiding the question of whether to use SSA or SIA). Alternate names: sham considerations? insignificant considerations?
6Rohin Shah3y
When you make an argument about a person or group of people, often a useful thought process is "can I apply this argument to myself or a group that includes me? If this isn't a type error, but I disagree with the conclusion, what's the difference between me and them that makes the argument apply to them but not me? How convinced I am that they actually differ from me on this axis?"
6Rohin Shah3y
An incentive for property X (for humans) usually functions via selection, not via behavior change. A couple of consequences: * In small populations, even strong incentives for X may not get you much more of X, since there isn't a large enough population for there to be much deviation on X to select on. * It's pretty pointless to tell individual people to "buck the incentives", even if they are principled people who try to avoid doing bad things, if they take your advice they probably just get selected against.
6Rohin Shah3y
Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made -- rather than thinking of a comment thread as "this is trying to ascertain whether X is true", they seem to instead read the comments, perform some sort of inference over what the author must believe if that comment were written in isolation, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere. I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).
4Rohin Shah3y
Let's say we're talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions. Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here's a fully general counterargument that Alice is wrong: Decompose P into a series of conjunctions Q1, Q2, ... Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.)  Ask Alice to estimate P(Qk | Q1, Q2, ... Q{k-1}) for all k. At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%). Argue that Alice can't possibly have enough knowledge to place under 1% on the negation of the statement. ---- What's the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move when both people think that is the right way to carve up the question. Most of the disagreement is likely in how to carve up the claim in the first place.
2Rohin Shah3y
An argument form that I like: I think this should be convincing even if Y is false, unless you can explain why your argument for X does not work under assumption Y. An example: any AI safety story (X) should also work if you assume that the AI does not have the ability to take over the world during training (Y).
2Matt Goldenberg3y
Trying to follow this. Doesn't the Y (AI not taking over the world during training) make it less likely that X(AI will take over the world at all)? Which seems to contradict the argument structure. Perhaps you can give a few more examples to make more clear the structure?
4Rohin Shah3y
In that example, X is "AI will not take over the world", so Y makes X more likely. So if someone comes to me and says "If we use <technique>, then AI will be safe", I might respond, "well, if we were using your technique, and we assume that AI does not have the ability to take over the world during training, it seems like the AI might still take over the world at deployment because <reason>". I don't think this is a great example, it just happens to be the one I was using at the time, and I wanted to write it down. I'm explicitly trying for this to be a low-effort thing, so I'm not going to try to write more examples now. EDIT: Actually, the double descent comment below has a similar structure, where X = "double descent occurs because we first fix bad errors and then regularize", and Y = "we're using an MLP / CNN with relu activations and vanilla gradient descent". In fact, the AUP power comment does this too, where X = "we can penalize power by penalizing the ability to gain reward", and Y = "the environment is deterministic, has a true noop action, and has a state-based reward". Maybe another way to say this is: I endorse applying the "X proves too much" argument even to impossible scenarios, as long as the assumptions underlying the impossible scenarios have nothing to do with X. (Note this is not the case in formal logic, where if you start with an impossible scenario you can prove anything, and so you can never apply an "X proves too much" argument to an impossible scenario.)
2Rohin Shah3y
"Minimize AI risk" is not the same thing as "maximize the chance that we are maximally confident that the AI is safe". (Somewhat related comment thread.)
2Rohin Shah3y
The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote. A particular flawed response is to look for N opinions that say "intervening is net negative" and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example.) However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.
1David Scott Krueger (formerly: capybaralet)3y
I didn't get it... is the problem with the "look for N opinions" response that you aren't computing the denominator (|"intervening is positive"| + |"intervening is negative"|)?
3Rohin Shah3y
Yes, that's the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative. (This is under the simple model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)
2Rohin Shah3y
Under the standard setting, the optimizer's curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer's curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don't already account for it).
4Lukas Finnveden10mo
This is true if "the standard setting" refers to one where you have equally robust evidence of all options. But if you have more robust evidence about some options (which is common), the optimizer's curse will especially distort estimates of options with less robust evidence. A correct bayesian treatment would then systematically push you towards picking options with more robust evidence. (Where I'm using "more robust evidence" to mean something like: evidence that has an overall greater likelihood ratio, and that therefore pushes you further from the prior. Where the error driving the optimizer's curse error is to look at the peak of the likelihood function while neglecting the prior and how much the likelihood ratio pushes you away from it.)
2Rohin Shah10mo
Agreed. (In practice I think it was rare that people appealed to the robustness of evidence when citing the optimizer's curse, though nowadays I mostly don't hear it cited at all.)

It's common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don't have very good arguments for these worries. Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.

I'll go through the articles I've read that argue for worrying about recommender systems, and explain why I find them unconvincing. I've only looked at the ones that are widely read; there are probably significantly better arguments that are much less widely read.

Aligning Recommender Systems as Cause Area. I responded briefly on the post. Their main arguments and my counterarguments are:

  1. A few sources say that it is bad + it has incredible scale + it should be super easy to solve. (I don't trust the sources and suspect the authors didn't check them; I agree there's huge scale; I don't see why it should be super easy to solve even if there is a problem, especially given that many of the supposed problems seem to have existed before recommender systems.)
  2. Maybe working on recommender systems w
... (read more)
7Daniel Kokotajlo3y
Thanks for this Rohin. I've been trying to raise awareness about the potential dangers persuasion/propaganda tools, but you are totally right that I haven't actually done anything close to a rigorous analysis. I agree with what you say here that a lot of the typical claims being thrown around seem based more on armchair reasoning than hard data. I'd love to see someone really lay out the arguments and analyze them... My current take is that (some of) the armchair theories seem pretty plausible to me, such that I'd believe them unless the data contradicts. But I'm extremely uncertain about this.
3Rohin Shah3y
I should note that there's a big difference between "recommender systems cause polarization as a side effect of optimizing for engagement" and "we might design tools that explicitly aim at persuasion / propaganda". I'm confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it's effects will be. Usually, for any sufficiently complicated question (which automatically includes questions about the impact of technologies used by billions of people, since people are so diverse), I think an armchair theory is only slightly better than a monkey throwing darts, so I'm more in the position of "yup, sounds plausible, but that doesn't constrain my beliefs about what the data will show and medium quality data will trump the theory no matter how it comes out".
5Daniel Kokotajlo3y
Oh, then maybe we don't actually disagree that much! I am not at all confident that optimizing for engagement has the side effect of increasing polarization. It seems plausible but it's also totally plausible that polarization is going up for some other reason(s). My concern (as illustrated in the vignette I wrote) is that we seem to be on a slippery slope to a world where persuasion/propaganda is more effective and widespread than it has been historically, thanks to new AI and big data methods. My model is: Ideologies and other entities have always been using propaganda of various kinds, and there's always been a race between improving propaganda tech and improving truth-finding tech, but we are currently in a big AI boom and in particular in a Big Data and Natural Language Processing boom, and this seems like it'll be a big boost to propaganda tech, and unfortunately I can't think of ways in which it will correspondingly boost truth-finding-ness across society, because while it can be used to make truth-finding tech maybe (e.g. prediction markets, fact-checkers, etc.) it seems like most people in practice just don't want to adopt truth-finding tech. It's true that we could design a different society/culture that used all this awesome new tech to be super truth-seeking and have a very epistemically healthy discourse, but it seems like we are not about to do that anytime soon, instead we are going in the opposite direction.
3Rohin Shah3y
I think that story involves lots of assumptions I don't immediately believe (but don't disbelieve either): * People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top) * Such people will quickly realize that AI will be very useful for this * They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned) * The resulting AI system will in fact be very good at persuasion / propaganda * AI that fights persuasion / propaganda either won't be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can't keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won't be true with AI) And probably there are a bunch of other assumptions I haven't even thought to question. I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be "raise awareness", it should be "figure out whether the assumptions are justified".
5Daniel Kokotajlo3y
That's all I'm trying to do at this point, to be clear. Perhaps "raise awareness" was the wrong choice of phrase. Re: the object-level points: For how I see this going, see my vignette, and my reply to steve. The bullet points you put here make it seem like you have a different story in mind. [EDIT: But I agree with you that it's all super unclear and more research is needed to have confidence in any of this.]
3Rohin Shah3y
Excellent :) (Link is broken, but I found the comment.) After reading that reply I still feel like it involves the assumptions I mentioned above. Maybe your point is that your story involves "silos" of Internet-space within which particular ideologies / propaganda reign supreme. I don't really see that as changing my object-level points very much but perhaps I'm missing something.
5Daniel Kokotajlo3y
I was confusing, sorry -- what I meant was, technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is... loaded? Designed to make them seem implausible? idk, something like that, in a way that made me wonder if you had a different story in mind. Going through them one by one: * People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top) * This is already happening in 2021 and previous, in my story it happens more. * Such people will quickly realize that AI will be very useful for this * Again, this is already happening. * They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned) * Plenty of people are already raising a moral outcry. In my story these people don't succeed in getting it banned, but I agree the story could be wrong. I hope it is! * The resulting AI system will in fact be very good at persuasion / propaganda * Yep. I don't have hard evidence, but intuitively this feels like the sort of thing today's AI techniques would be good at, or at least good-enough-to-improve-on-the-state-of-the-art. * AI that fights persuasion / propaganda either won't be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can't keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won't be true with AI) * I think it won't be built & deployed in such a way that collective epistemology is overall improved. Instead, the propaganda-fighting AIs will themselves have blind spots, to allow in the propaganda of the "good guys." The CCP will have their propaganda-fighting AIs, the Western Left will have theirs, the Western Right will have theirs, etc. (I think what happened with the internet is preceden
4Rohin Shah3y
I don't think it's designed to make them seem implausible? Maybe the first one? Idk, I could say that your story is designed to make them seem plausible (e.g. by not explicitly mentioning them as assumptions). I think it's fair to say it's "loaded", in the sense that I am trying to push towards questioning those assumptions, but I don't think I'm doing anything epistemically unvirtuous. This does not seem obvious to me (but I also don't pay much attention to this sort of stuff so I could be missing evidence that makes it very obvious). That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments. I don't really see "number of facts" as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now. (I just tried to find the best argument that GMOs aren't going to cause long-term harms, and found nothing. We do at least have several arguments that COVID vaccines won't cause long-term harms. I armchair-conclude that a thing has to get to the scale of COVID vaccine hesitancy before people bother trying to address the arguments from the other side.)
4Daniel Kokotajlo3y
Perhaps I shouldn't have mentioned any of this. I also don't think you are doing anything epistemically unvirtuous. I think we are just bouncing off each other for some reason, despite seemingly being in broad agreement about things. I regret wasting your time. The first bit seems in tension with the second bit, no? At any rate, I also don't see number of facts as the relevant thing for epistemology. I totally agree with your take here.
4Rohin Shah3y
"Truthful counterarguments" is probably not the best phrase; I meant something more like "epistemically virtuous counterarguments". Like, responding to "what if there are long-term harms from COVID vaccines" with "that's possible but not very likely, and it is much worse to get COVID, so getting the vaccine is overall safer" rather than "there is no evidence of long-term harms".
If you look at my posting history, you'll see that all posts I've made on LW (two!) are negative toward social media and one calls out recommender systems explicitly. This post has made me reconsider some of my beliefs, thank you. I realized that, while I have heard Tristan Harris, read The Attention Merchants, and perused other, similar sources, I haven't looked for studies or data to back it all up. It makes sense on a gut level--that these systems can feed carefully curated information to softly steer a brain toward what the algorithm is optimizing for--but without more solid data, I found I can't quite tell if this is real or if it's just "old man yells at cloud." Subjectively, I've seen friends and family get sucked into social media and change into more toxic versions of themselves. Or maybe they were always assholes, and social media just lent them a specific, hivemind kind of flavor, which triggered my alarms? Hard to say.
6Rohin Shah3y
Thanks, that's good to hear. Fwiw, I am a lot more compelled by the general story "we are now seeing examples of bad behavior from the 'other' side that are selected across hundreds of millions of people, instead of thousands of people; our intuitions are not calibrated for this" (see e.g. here). That issue seems like a consequence of more global reach + more recording of bad stuff that happens. Though if I were planning to make it my career I would spend way more time figuring out whether that story is true as well.
This was a good post. I'd bookmark it, but unfortunately that functionality doesn't exist yet.* (Though if you have any open source bookmark plugins to recommend, that'd be helpful.) I'm mostly responding to say this though: While it wasn't otherwise mentioned in the abstract of the paper (above), this was stated once: I though this was worth calling out, although I am still in the process of reading that 10/14 page paper. (There are 4 pages of references.) And some other commentary while I'm here: I imagine the recommender system is only as good as what it has to work with, content wise - and that's before getting into 'what does the recommender system have to go off of', and 'what does it do with what it has'. This part wasn't elaborated on. To put it a different way: Do the people 'who know what's going' on (presumably) have better arguments? Do you? *I also have a suspicion it's not being used. I.e., past a certain number of bookmarks like 10, it's not actually feasible to use the LW interface to access them.
2Rohin Shah3y
Possibly, but if so, I haven't seen them. My current belief is "who knows if there's a major problem with recommender systems or not". I'm not willing to defer to them, i.e. say "there probably is a problem based on the fact that the people who've studied them think there's a problem", because as far as I can tell all of those people got interested in recommender systems because of the bad arguments and so it feels a bit suspicious / selection-effect-y that they still think there are problems. I would engage with arguments they provide and come to my own conclusions (whereas I probably would not engage with arguments from other sources). No. I just have anecdotal experience + armchair speculation, which I don't expect to be much better at uncovering the truth than the arguments I'm critiquing.
This might still be good for generating ideas (if not far more accurate than brainstorming or trying to come up with a way to generate models via 'brute force'). But the real trick is - how do we test these sorts of ideas?
2Rohin Shah3y
Agreed this can be useful for generating ideas (and I do tons of it myself; I have hundreds of pages of docs filled with speculation on AI; I'd probably think most of it is garbage if I went back and looked at it now). We can test the ideas in the normal way? Run RCTs, do observational studies, collect statistics, conduct literature reviews, make predictions and check them, etc. The specific methods are going to depend on the question at hand (e.g. in my case, it was "read thousands of articles and papers on AI + AI safety").
The incentive of social media companies to invest billions into training competitive RL agents that make their users spend as much time as possible in their platform seem like an obvious reason to be concerned. Especially when such RL agents plausibly already select a substantial fraction of the content that people in developed countries consume.
1Rohin Shah3y
I don't trust this sort of armchair reasoning. I think this is sufficient reason to raise the hypothesis to attention, but not enough to conclude that it is likely a real concern. And the data I have seen does not seem kind to the hypothesis (though there may be better data out there that does support the hypothesis).
7Rohin Shah3y
I am more annoyed by the sheer confidence people have. If they were saying "this is a possibility, let's investigate" that seems fine. Re: the rest of your comment, I feel like you are casting it into a decision framework while ignoring the possible decision "get more information about whether there is a problem or not", which seems like the obvious choice given lack of confidence. If at some point you become convinced that it is impossible / too expensive to get more information (I'd be really suspicious, but it could be true) then I'd agree you should bias towards worry. I would guess that the fact that people regularly fail to inhabit the mindset of "I don't know that this is a problem, let's try to figure out whether it is actually a problem" is a source of tons of problems in society (e.g. anti-vaxxers, worries that WiFi radiation kills you, anti-GMO concerns, worries about blood clots for COVID vaccines, ...). Admittedly in these cases the people are making a mistake of being confident, but even if you fixed the overconfidence they would continue to behave similarly if they used the reasoning in your comment. Certainly I don't personally know why you should be super confident that GMOs aren't harmful, and I'm unclear on whether humanity as a whole has the knowledge to be super confident in that.

I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I'd crosspost here as a reference.

  • Assistance games / CIRL is a similar sort of thing as CEV. Just as CEV is English poetry about what we want, assistance games are math poetry about what we want. In particular, neither CEV nor assistance games tells you how to build a friendly AGI. You need to know something about how the capabilities arise for that.
  • One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
  • Well-specified assistive agents (i.e. ones where you got the observation model and reward space exactly correct) do many of the other nice things corrigible agents do, like the 5 bullet points at the top of this post. Obviously we don't know how to correctly specify the observation model and reward space, so this is not a solution to alignme
... (read more)
I think this is way more worrying in the case where you're implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower. I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem - in those cases, the way you know how 'human preferences' rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that's probably not well-modelled by Boltzmann rationality (e.g. the thing I'm most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.
It's also not super clear what you algorithmically do instead - words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.
2Rohin Shah2y
That's what future research is for!
2Rohin Shah2y
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point). I agree Boltzmann rationality (over the action space of, say, "muscle movements") is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including "things that humans say", and the human can just tell you that hyperslavery is really bad. Obviously you can't trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes. (Ideally you'd figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of "getting a good observation model" while you still have the ability to turn off the model. It's hard to say exactly what that would look like since I don't have a great sense of how you get AGI capabilities under the non-ML story.)
3Rohin Shah2y
I mentioned above that I'm not that keen on assistance games because they don't seem like a great fit for the specific ways we're getting capabilities now. A more direct comment on this point that I recently wrote:

So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.

The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about what assumptions they're making. But we can look at the details in the paper.

(This next part isn't fully self-contained, you'll have to look at the notation and Definitions 1 and 3 in the paper to fully follow along.)

(EDIT: The following is wrong, see followup with Lukas, I misread one of the definitions.)

Looking into it I don't think the theorem even holds? In particular, Theorem 1 says:

Theorem 1. Let γ ∈ [−1, 0) and let B be a behaviour and P be an unprompted language model such that B is α, β, γ-distinguishable in P (definition 3), then P is γ-prompt-misalignable to B (definition 1) with prompt length of O(log 1 / Є , log 1 / α

... (read more)

Note that B is (0.2,10,−1)-distinguishable in P.

I think this isn't right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.

And for your counterexample, s* = "C" will have B_P-(s*) be 0 (because there's 0 probably of generating "C" in the future). So the sup is at least 0 > -1.

(Note that they've modified the paper, including definition 3, but this comment is written based on the old version.)

4Rohin Shah8mo
You're right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example. I'm still not very compelled by the theorem -- it's saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don't really feel like I've learned anything from this theorem.
3Johannes Treutlein8mo
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution  P=αP0+(1−α)P1, such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have P(s∣s0)=P(s⊗s0)P(s0). Together with the assumption that P0 is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for P0 by stringing together bad sentences in the prompt work. To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with α probability and from a good distribution with (1−α) probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components P0 and P1, where one of the components always samples from the bad distribution. This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either P0 has to be able to also output good sentences sometimes, or the assumption P=αP0+(1−α)P1 is violated). I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this
2Lukas Finnveden8mo
Yeah, I also don't feel like it teaches me anything interesting.

I occasionally hear the argument "civilization is clearly insane, we can't even do the obvious thing of <insert economic argument here, e.g. carbon taxes>".

But it sounds to me like most rationalist / EA group houses didn't do the "obvious thing" of taxing COVID-risky activities (which basically follows the standard economic argument of pricing in externalities). What's going on? Some hypotheses:

  1. Actually, taxing COVID-risky activities is not a good solution EDIT: and group houses recognized this. (Why? It seemed to work pretty well for my group house.)
  2. Actually, rationalist / EA group houses did tax COVID-risky activities. (Plausible, I don't know that much about other group houses, but what I've heard doesn't seem consistent with this story.)
  3. That would have been a good solution, but it requires some effort to set up, and the benefits aren't worth it. (Seems strange, especially after microCOVID existed it should take <10 person-hours to implement an actual system, and it sounds like group houses had a lot of COVID-related trouble that they would gladly have paid 10 person-hours to avoid. Maybe it takes much longer to agree on what system to implement, and that was the blocker? But didn't people take lots of time deciding what system to implement anyway?)
  4. That would have been a good solution, but EAs / rationalists systematically failed to think of it or implement it. (Why? This is basically just a Pigouvian tax, which I hear EAs / rationalists talk about all the time -- in fact that's how I learned the term.)

Our house implemented cap and trade (i.e. "You must impose at most X risk" instead of "You must pay $X per unit of risk.").

  • Both yield efficient outcomes for the correct choice of X, so the question is just how well you can figure out the optimal levels of exposure vs. the marginal cost of exposure. If costs are linear in P(COVID) then the marginal cost is in some sense strictly easier (since the way you figure out levels is by combining marginal costs with the marginal cost of prevention) which is why you'd expect a Pigouvian tax to be better.
  • But a cap can still be easier to figure out (e.g. there is no way to honestly elicit costs from individuals when they have very different exposures to COVID, and the game theory of finding a good compromise is super complicated and who knows what's easier). Caps also allow you to say things like "Look the total level of exposure is not that high as long as we are under this cap, so we can stop thinking about it rather than worrying that we've underestimated costs and may incur a high level of risk." You could get the same benefit by setting an approximate cost and then revising if the total level goes above a threshold (and conversely in this
... (read more)
3Rohin Shah3y
Yeah, that. I'm definitely relying on some level of goodwill / cooperation / trying to find the best joint group decision, or something like that. (Though I think all systems rely on that at least somewhat.) I guess you mean the random frictions in figuring out what system to use? One of the big reasons I prefer the Pigouvian tax over cap-and-trade is that you don't have to trade to get the efficient outcome, which means after an initial one-time cost to set the price (and occasional checks to reset the price) everyone can just do their own thing without having to coordinate with others. (Also, did most people who set a cap / budget then also trade? Seems pretty far from efficient if you neglect the "trade" part) I just checked, and it looks like we had ~0.3% of (estimated) exposure over the course of roughly a year. I think it's plausible though that we overestimated the risk initially and then failed to check later (in particular I think we used a too-high IFR, based on this comment).

At Event Horizon we had a policy for around 6-9 months where if you got a microcovid, you paid $1 to the house, and it was split between everyone else. Do whatever you like, we don't mind, as long as you bring a microcovid estimate and pay the house.

9Rohin Shah3y
Nice, that's identical to ours.
2Eli Tyre3y
Instrumental convergence!
6Ben Pace3y
Or just logical convergence. Two calculators get the same answer to 2 + 2 = 4, and it's not because they're both power-seeking.
4Eli Tyre3y
Good point. But in this case, you guys are both seeking utility, right? And that's what pushed you to some common behaviors?
8Matthew Barnett3y
That gives an implied cost of $1 million dollars for someone getting COVID-19, which seems way overpriced to me. I thought I'd do a quick Fermi estimate to verify my intuitions. I don't know how many people are in Event Horizon, but I'll assume 15. Let's say that on average about 10 people will get COVID-19 if one person gets it, due to some people being able to isolate successfully. I'm going to assume that the average age there is about 30, and the IFR is roughly 0.02% based on this paper. That means roughly 0.002 expected deaths will result. I'll put the price of life at $10 million. I'll also assume that each person loses two weeks of productivity equivalent to a loss of $20 per hour for 80 hours = $1600, and I'll assume a loss of well-being equivalent to $10 per hour for 336 hours = $3360. Finally, I'll assume the costs of isolation are $1,000 per person. Together, this combines to $10M x 0.002 + ($1600 + $3360) x 10 + $1000 x 15 = $84,600. However, I didn't include the cost of long-covid, which could plausibly raise this estimate radically depending on your beliefs. But personally I'm already a bit skeptical that 15 people would be willing to collectively pay $86,400 to prevent an infection in their house with certainty, so I still feel my initial intuition was mostly justified.
(I lived in this house) The estimate was largely driven by fear of long covid + a much higher value per hour of time, which also factored in altruistic benefits from housemate's work that aren't captured by the market price of their salary.  There were also about 8 of us, and we didn't assume everyone would get it conditional on infection (household attack rates are much lower than that, and you might have time to react and quarantine). We assumed maybe like 2-3 others. I totally expect we would have paid $84,600 to prevent a random one of us getting covid -- and it would've even looked like a pretty cheap deal compared to getting it! 
2Matthew Barnett3y
Makes sense, though FWIW I wasn't estimating their wage at $20 an hour. Most cases are mild, and so productivity won't likely suffer by much in most cases. I think even if the average wage there is $100 after taxes, per hour (which is pretty rich, even by Bay Area standards), my estimate is near the high end of what I'd expect the actual loss of productivity to be. Though of course I know little about who is there. ETA: One way of estimating "altruistic benefits from housemate's work that aren't captured by the market price of their salary" is to ask at what after-tax wage you'd be willing to work for a completely pointless project, like painting a wall, for 2 weeks. If it's higher than $100 an hour I commend those at Event Horizon for their devotion to altruism!
5Ben Pace3y
If it's 8 hour workdays and 5 days a week, at $100/hour that's 8 * 10 * 100 = $8k. No, you could not pay me $8k to stop working on the LW team for 2 weeks. I think $30k-$40k might make sense.
7Matthew Barnett3y
I'm kind of confused right now. At a mere $15k, you could probably get a pretty good software engineer to work for a month on any altruistic project you wish. I'm genuinely curious about why you think your work is so irreplaceable (and I'm not saying it isn't!).

You could certainly hire a good software engineer at that salary, but I don’t think you could give them a vision and network and trust them to be autonomous. Money isn’t the bottleneck there. Just because you have the funding to hire someone for a role doesn’t mean you can. Hiring is incredibly difficult. Go see YC on hiring, or PG.

Most founding startup people are worth way more than their salary.

2Rohin Shah3y
When my 15-person house did the calculation, we had a higher IFR estimate (I think 0.1%) and a 5x multiplier for long COVID, which gets you most of the way there. Not sure why we had a higher IFR estimate -- it might be because we made this estimate in ~June 2020 when we had worse data, or plausibly IFR was actually higher then, or we raised it to account for the fact that some people were immunocompromised. (Fwiw, at < $6000 per person that seems like a bargain to me.  At the full million, it would be ~$63,000 per person, which is now sounding iffy, but still plausible. Maybe it shouldn't be plausible given how low the IFR is -- 0.02% does feel quite a bit lower than I had been imagining.) Still, I think you shouldn't ask about paying large sums of money -- the utility-money curve is pretty sharply nonlinear as you get closer to 0 money, so the amount you'd pay to avoid a really bad thing is not 100x the amount you'd pay to avoid a 1% chance of that bad thing. (See also reply to TurnTrout below.) You could instead ask about how much people would have to be paid for someone with COVID to start living at the house; this still has issues with nonlinear utility-money curves, but significantly less so than in the case where they're paying. That is, would people accept a little under $6000 to have a COVID-infected person live with them?
2Matthew Barnett3y
Possibly my intuition here comes from seeing COVID-19 risks as not too dissimilar from other risks for young people, like drinking alcohol or doing recreational drugs, accidental injury in the bathroom, catching the common cold (which could have pretty bad long-term effects), kissing someone (and thereby risk getting HSV-1 or the Epstein–Barr virus), eating unhealthily, driving, living in an area with a high violent crime rate, insufficiently monitoring one's body for cancer, etc. I don't usually see people pay similarly large costs to avoid these risks, which naturally makes me think that people don't actually value their time or their life as much as they say.  One possibility is that everyone would start paying more to avoid these risks if they were made more aware of them, but I'm pretty skeptical. The other possibility seems more likely to me: value of life estimates are susceptible to idealism about how much people actually value their own life and time, and so when we focus on specific risk evaluations, we tend to exaggerate. ETA: Another possibility I didn't mention is that rationalists are just rich. But if this is the case, then why are they even in a group house? I understand the community aspect, but living in a group house is not something rich people usually do, even highly social rich people. Makes sense.
5Rohin Shah3y
So the $6000 cost is averting roughly 100 micromorts (~50% of catching it from the new person * 0.02% IFR), ignoring long COVID. Most of the things you list sound like < 1 micromort-equivalent per instance? That sounds pretty consistent. E.g. Suppose unhealthy eating knocks off ~5 years of lifespan (let's call that 10% as bad as death, i.e. 10^5 micromorts). You have 10^3 meals a year, times about 50 years, for 5 * 10^4 meals, so each meal is roughly 2 micromorts = $120 of cost. On this model, you should see people caring about their health, but not to an extraordinary degree, e.g. after getting the first 90% of benefit, then you stop (presumably you value a tasty meal at ~$12 more than a not-tasty meal, again thinking at the margin). And empirically that seems roughly right -- most of the people I know think about health, try to get good macronutrient profiles, take supplements where relevant, but they don't go around conducting literature reviews to figure out the optimal diet to consume. Also, I think partly you might be underestimating how risk-avoiding people at Event Horizon and my house are -- I'd say both houses are well above the typical rationalist. (And also that a good number of these people are in fact rich, if we count a typical software engineer as rich.) There's a pretty big culture difference between rationalists and stereotypical rich people. One of those is living in a group house. I currently prefer a group house over a traditional you-and-your-partner house regardless of how much money I have.
2Ben Pace3y
List of changes that stand out to me: * I ended up saying that long-covid costs were roughly the same as death, so it was a factor of 2x. * Price of a life at $10 million is a bit low, I put mine at $50 million, so a factor of 5x difference. I didn't follow all of your calculations about being out for 2 weeks and isolated, I basically just did those two (death and long covid) and it came to ~$200k for me. Roughly say that's the average among 5 people and then you get to $1 per microcovid to the house.

My best guess is that rationalists aren't that sane, especially when they've been locked up for a while and are scared and socially rewarding others being scared.

8Rohin Shah3y
8Matthew Barnett3y
Part of the issue is that there's rarely a natural way of pricing Pigouvian taxes. You can make price estimates based on how people hypothetically judge the harm to themselves, but there's always going to be huge disagreements.  This flaw is a reasonable cause for concern. Suppose you were in a group house where half of the people worked remotely and the other half did not. The people who worked remotely might be biased (at least rhetorically) towards the proposition that the Pigouvian tax should be high, and the people who work in-person might be biased in the other direction. Why? Because if someone doesn't expect to have to pay the tax, but does expect to receive the revenue, they may be inclined to overestimate the harm of COVID-19, as a way of benefiting from the tax, and vice versa. In regards to carbon taxes, it's often true that policies sound like the "obvious" thing to do, but actually have major implementation flaws upon closer examination. This can help explain why societies don't do it, even if it seems rational. Noah Smith outlines the case against a carbon tax here, Of course, this argument shouldn't stop a perfectly altruistic community from implementing a carbon tax. But if the community was perfectly altruistic, the carbon tax would be unnecessary.
4Rohin Shah3y
Tbc, I'm pretty sympathetic to this response to the general class of arguments that "society is incompetent because they don't do X" (and it is the response I would usually make). Yeah, I agree that in theory this could be a reason not to do it (though similar arguments also apply to other methods, e.g. in a budgeting system, people with remote jobs can push for a lower budget). My real question though is: did people actually do this? Did they consider the possibility of a tax, discuss it, realize they couldn't come to an agreement on price, and then implement something else? If so, that would answer my question, but I don't think this is what happened.
2Matthew Barnett3y
Probably not, although they lived in a society in which the response "just use Pigouvian taxes" was not as salient as it otherwise could have been in their minds. This reduced saliency was, I believe, at least partly due to fact that Pigouvian taxes have standard implementation issues. I meant to contribute one of these issues as a partial explanation, rather than respond to your question more directly.
4Rohin Shah3y
Makes sense, thanks. I still feel confused about why they weren't salient to EAs / rationalists, but I agree that the fact they aren't salient more broadly is something-like-a-partial-explanation.
TBH I think what made the uCOVID tax work was that once you did some math, it was super hard to justify levels that would imply anything like the existing risk-avoidance behaviour. So the "active ingredient" was probably just getting people to put numbers on the cost-benefit analysis. [context note: I proposed the EH uCOVID tax]
I feel like Noah's argument implies that states won't incur any costs to reduce CO2 emissions, which is wrong. IMO, the argument for a Pigouvian tax in this context is that for a given amount of CO2 reduction that you want, the tax is a cheaper way of getting it than e.g. regulating which technologies people can or can't use.
2Matthew Barnett3y
Since the argument about internalizing externalities fails in this case (as the tax is local), arguably the best way of modeling the problem is viewing each community as having some degree of altruism. Then, just as EAs might say “donate 10% of your income in a cause neutral way” the argument is that communities should just spend their “climate change money” reducing carbon in the way that’s most effective, even if it’s not rationalized in some sort of cost internalization framework. And Noah pointed out in his article (though not in the part I quoted) that R&D spending is probably more effective than imposing carbon taxes.
Note that a) some group houses just did this, b) a major answer for why people didn't do particularly novel things with microcovid was "by the time it came out, people were pretty exhausted out from covid negotiation, and doing whatever default thing was suggested was easier."
2Rohin Shah3y
a) Do you have a sense for the proportion of group houses that did it? And the proportion of group houses that seriously considered it? (My guess would be that 10-20% did it, and an additional 10% considered it.) Re: b) That does seem like a good chunk of the explanation, thanks. I do expect the Pigouvian tax would have been a better policy even prior to existing, given how much knowledge about COVID people had, so I'm still wondering why it wasn't considered even before existed. (I remember doing explicit risk calculations back in April / May 2020, and I think there's a good chance we would have implemented a similar Pigouvian tax system even without microcovid existing, with worse risk estimates.)
I actually guess even fewer houses than you're thinking did it (I think I only know if like 1-3).  In my own house, where I think we could have come up with the Pigouvian tax, I think when we did all our initial negotiations in April, I think the thinking was "hunker down for a month while we wait to see how bad Covid actually is, to avoid tail risks of badness, and then re-evaluate" but then it turned out by the time we got to the "re-evaluate" step, people were burned out on negotiation.
4Rohin Shah3y
(So far we have 3 -- my house, Event Horizon, and Mark Xu's house, assuming that's not also Event Horizon.)
2Ben Pace3y
Mark Xu’s house is not EH.
I like this question. If I had to offer a response from econ 101:  Suppose people love eating a certain endangered species of whale, and that people would be sad if the whale went extinct, but otherwise didn't care about how many of these whales there were. Any individual consumer might reason that their consumption is unlikely to cause the whale to go extinct. We have a tragedy of the commons, and we need to internalize the negative externalities of whale hunting. However, the harm is discontinuous in the number of whales remaining: there's an irreversible extinction point. Therefore, Pigouvian taxes aren't actually a good idea because regulators may not be sure what the post-tax equilibrium quantity will be. If the quantity is too high, the whales go extinct. Therefore, a "cap and trade" program would work better: there are a set number of whales that can be killed each year, and firms trade "whale certificates" with each other. (And, IIRC, if # of certificates = post-tax equilibrium quantity, this scheme has the same effect as a Pigouvian tax of the appropriate amount.) Similarly: if I, a house member, am unsure about others' willingness to pay for risky activities, then maybe I want to cap the weekly allowable microcovids and allow people to trade them amongst themselves. This is basically a fancier version of "here's the house's weekly microcovid allowance" which I heard several houses used. I'm protecting myself against my uncertainty like "maybe someone will just go sing at a bar one week, and they'll pay me $1,000, but actually I really don't want to get sick for $1,000." (EDIT: In this case, maybe you need to charge more per microcovid? This makes me less confident in the rest of this argument.) There are a couple of problems with this argument. First, you said taxes worked fine for your group house, which somewhat (but not totally) discredits all of this theorizing. Second, (4) seems most likely. Otherwise, I feel like we might have heard about covid
4Rohin Shah3y
Yeah, this. The beautiful thing about microCOVIDs is that because they are probabilities, the goodness of an outcome really is linear in terms of microCOVIDs incurred, and so the "cost" of incurring a microCOVID is the same no matter "when" you incur it, so it's very easy to price. (Unlike the whale example, where the goodness of the outcome is not linear in the number of whales, and so killing a single whale has different costs depending on when exactly it happens.) You might still end up with nonlinear costs if your value of money is nonlinear on the relevant scale, e.g. maybe the first $1,000 is really great but the next $10,000 isn't 10x as great, and so you need to be paid more after the first $1,000 for the same number of microcovids, but I don't think this is really how people in our community feel? I guess another way you get nonlinear costs is if you really do need to incur some microcovids, and then the amount you pay matters a lot -- maybe the first $10 is fine, but then $1,000 isn't, because you don't have a huge financial buffer to draw from, so while the downside of a microcovid stays constant, the downside of paying money for it changes. I didn't get the sense that this would be a real problem for most group houses, since people were in general being very cautious and so wouldn't have paid much, but maybe it would have affected things. Partly for this reason and partly out of a sense of fairness, at my group house we didn't charge for "essential" microcovids, such as picking up drug prescriptions (assuming you couldn't get them delivered) or (in my case) an in-person appointment to get a visa.
Another way costs are nonlinear in uCOVIDs is if you think you'll probably get COVID.
2Rohin Shah3y
Yeah, fair point, the linearity only works as long as you expect probabilities to remain small. (Which, to be clear, is something you should expect, in the context of most EA / rationalist group houses.)
5Mark Xu3y
My house implemented such a tax. Re 1, we ran into some of the issues Matthew brought up, but all other COVID policies are implicitly valuing risk at some dollar amount (possibly inconsistently), so the Pigouvian tax seemed like the best option available.
2Rohin Shah3y
Nice! And yeah, that matches my experience as well.
Carbon taxes are useful for market transactions. A lot of interactions within a group house aren't market transactions. Decisions about who brings out the trash aren't made through market mechanisms. Switching to making all the transactions in a group house market based will create a lot of conflict and isn't just about how to deal with COVID-19. 
Perhaps I don't follow. why would you have to market-base "all the transactions in a group house", instead of just the COVID-19 ones?
Using a market-based mechanism in an enviroment where the important decisions are market-based is easier then introducing a market based mechanism in an enviroment where most decisions are not. If you introduce a market-based mechanism around COVID-19 you get a result where rich members in the house can take more risk then the poorer ones which goes against assumptions of equality between house members (and most group houses work on assumptions of equality). 
3Rohin Shah3y
Personally, I don't really feel the force of this argument -- I feel like on either side I get a good deal (on the rich side, I get to do more things, on the poor side, I get paid more money than I would pay to avoid the risk). I agree other people feel the force of this though, and I don't really know why. (But like, also, shouldn't this apply to carbon taxes or all the other economic arguments that civilization is "insane" for not doing?) (Also also, don't we already see e.g. rich members getting larger, nicer rooms than poorer members? What's the difference?) (Chores are different in that they aren't a very big deal. If they are a big deal to you, then you hire a cleaner. If they're not a big enough deal that you'd hire a cleaner, then they're not a big enough deal to bother with a market, which does have transaction costs.) As a single data point, the COVID tax didn't create conflict in my group house (despite having non-trivial income inequality, and one of the richer housemates indeed taking on more risk than others), though admittedly my house is slightly more market-transaction-y than most.

What won't we be able to do by (say) the end of 2025? (See also this recent post.) Well, one easy way to generate such answers would be to consider tasks that require embodiment in the real world, or tasks that humans would find challenging to do. (For example, “solve the halting problem”, “produce a new policy proposal that has at least a 95% chance of being enacted into law”, “build a household robot that can replace any human household staff”.) This is cheating, though; the real challenge is in naming something where there’s an adjacent thing that _does... (read more)

4Daniel Kokotajlo3y
Nice! I really appreciate that you are thinking about this and making predictions. I want to do the same myself. I think I'd put something more like 50% on "Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post." That's just a wild guess, very unstable. Another potential prediction generation methodology: Name something that you think won't happen, but you think I think will.
6Rohin Shah3y
This seems more feasible, because you can cherrypick a single good example. I wouldn't be shocked if someone on LW spent a lot of time reading AI-written blog posts on rationality and posted the best one, and I liked that more than a typical >30 karma post. My default guess is that no one tries to do this, so I'd still give it < 50% (maybe 30%?), but conditional on someone trying I think probably 80% seems right. (EDIT: Rereading this, I have no idea whether I was considering a timeline of 2025 (as in my original comment) or 2030 (as in the comment I'm replying to) when making this prediction.) I spent a bit of time on this but I think I don't have a detailed enough model of you to really generate good ideas here :/ Otoh, if I were expecting TAI / AGI in 15 years, then by 2030 I'd expect to see things like: * An AI system that can create a working website with the desired functionality "from scratch" (e.g. a simple Twitter-like website, an application that tracks D&D stats and dice rolls for you, etc, a simple Tetris game with an account system, ...). The system allows even non-programmers to create these kinds of websites (so cannot depend on having a human programmer step in to e.g. fix compiler errors or issue shell commands to set up the web server). * At least one large, major research area in which human researcher productivity has been boosted 100x relative to today's levels thanks to AI. (In calculating the productivity we ignore the cost of running the AI system.) Humans can still be in the loop here, but the large majority of the work must be done by AIs. * An AI system gets 20,000 LW karma in a year, when limited to writing one article per day and responses to any comments it gets from humans. (EDIT: I failed to think about karma inflation when making this prediction and feel a bit worse about it now.) * Productivity tools like todo lists, memory systems, time trackers, calendars, etc are made effectively obsolete (or at least the user interfaces a
6Daniel Kokotajlo3y
Ah right, good point, I forgot about cherry-picking. I guess we could make it be something like "And the blog post wasn't cherry-picked; the same system could be asked to make 2 additional posts on rationality and you'd like both of them also." I'm not sure what credence I'd give to this but it would probably be a lot higher than 10%. Website prediction: Nice, I think that's like 50% likely by 2030. Major research area: What counts as a major research area? Suppose I go calculate that Alpha Fold 2 has already sped up the field of protein structure prediction by 100x (don't need to do actual experiments anymore!), would that count? If you hadn't heard of AlphaFold yet, would you say it counted? Perhaps you could give examples of the smallest and easiest-to-automate research areas that you think have only a 10% chance of being automated by 2030. 20,000 LW karma: Holy shit that's a lot of karma for one year. I feel like it's possible that would happen before it's too late (narrow AI good at writing but not good at talking to people and/or not agenty) but unlikely. Insofar as I think it'll happen before 2030 it doesn't serve as a good forecast because it'll be too late by that point IMO. Productivity tool UI's obsolete thanks to assistants: This is a good one too. I think that's 50% likely by 2030. I'm not super certain about any of these things of course, these are just my wild guesses for now.
6Rohin Shah3y
I was thinking 365 posts * ~50 karma per post gets you most of the way there (18,250 karma), and you pick up some additional karma from comments along the way.  50 karma posts are good but don't have to be hugely insightful; you can also get a lot of juice by playing to the topics that tend to get lots of upvotes. Unlike humans the bot wouldn't be limited by writing speed (hence my restriction of one post per day). AI systems should be really, really good at writing, given how easy it is to train on text. And a post is a small, self-contained thing, that takes not very long to create (i.e. it has short horizons), and there are lots of examples to learn from. So overall this seems like a thing that should happen well before TAI / AGI. I think I want to give up on the research area example, seems pretty hard to operationalize. (But fwiw according to the picture in my head, I don't think I'd count AlphaFold.)

OK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed.

That said, I don't think this is that likely I guess... probably AI will be unable to do even three such posts, or it'll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.

4Rohin Shah3y
I'd be pretty surprised if that happened. GPT-3 already knows way more facts than I do, and can mimic far more writing styles than I can. It seems like by the time it can write any good posts (without cherrypicking), it should quickly be able to write good posts on a variety of topics in a variety of different styles, which should let it scale well past 20 posts. (In contrast, a specific person tends to write on 1-2 topics, in a single style, and not optimizing that hard for karma, and many still write tens of high-scoring posts.)

Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to on and .

Consider some starting state , some starting action , and consider the optimal trajectory under that starts with that, which we'll denote as . Define ... (read more)

Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:

  1. Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
  2. Malleable motivations: There is a "nearby" model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
  3. Stron
... (read more)
I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5. 3. I don't know about "the shape of the loss landscape" but there will be problems with "the developers wrote correct code" because "correct" here includes that it doesn't have side-effects that the model can self-exploit (though I don't think this is the biggest problem). 4. Correct rewards means two things:  * a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing. * b) That we understand all the consequences of the reward - at least sufficiently to avoid goodharting but also long-term consequences. It seems there was more work on a) with ChatGPT, but there was goodharting and even with ChatGPT one can imagine a lot of value lost due to exclusion of human values. 5. It seems clear that the ChatGPT training didn't include enough exploration and with smarter moders that have access to their own output (Bing) there will be incredible amounts of potential failure modes. I think that an adversarial mindset is needed to come up with ways to limit the exploration space drastically.

The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).

Let's consider a model where there are clusters , where each cluster contains trajectories whose feature... (read more)

I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.

Define the reachability , where  is the optimal policy for getting from to , and is the length of the trajectory. This is the notion of reachability both in the original paper and the new ... (read more)

I often search through the Alignment Newsletter database to find the exact title of a relevant post (so that I can link to it in a new summary), often reading through the summary and opinion to make sure it is the post I'm thinking of.

Frequently, I read the summary normally, then read the first line or two of the opinion and immediately realize that it wasn't written by me.

This is kinda interesting, because I often don't know what tipped me off -- I just get a sense of "it doesn't sound like me". Notably, I usually do agree with the opinion, so it isn't ab... (read more)

This?: Or something in here?:
2Rohin Shah3y
Yes (or more specifically, the private version from which that public one is automatically created).
1Josh Jacobson3y
How confident are you that this isn’t just memory? I personally think that upon rereading writing, it feels significantly more familiar if i wrote it, than if I read and edited it. A piece of this is likely style, but I think much of it is the memory of having generated and more closely considered it.
2Rohin Shah3y
It's plausible, though note I've probably summarized over a thousand things at this point so this is quite a demand on memory. But even so it still doesn't explain why I don't notice while reading the summary but do notice while reading the opinion. (Both the summary and opinion were written by someone else in the motivating example, but I only noticed from the opinion.)
3Josh Jacobson3y
Ah, this helps clarify. My hypotheses are then: 1. Even if you "agree" with an opinion, perhaps you're highly attuned, but in a possibly not straightforward conscious way, to even mild (e.g. 0.1%) levels of disagreement. 2. Maybe the word choice you use for summaries is much more similar to others vs the word choice you use for opinions. 3. Perhaps there's just a time lag, such that you're starting to feel like a summary isn't written by you but only realize by the time you get to the later opinion. #3 feels testable if you're so inclined.
3Rohin Shah3y
(Not that inclined currently, but I do agree that all of these hypotheses are plausible)

The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:

And then to decompose training loss across specific parameters:

... (read more)

In my double descent newsletter, I said:

This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]

This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradien
... (read more)