Joe Carlsmith

Senior research analyst at Open Philanthropy. Doctoral student in philosophy at the University of Oxford. Opinions my own.

Wiki Contributions


Yeah, as I say, I think "neither innocent nor guilty" is least misleading -- but I find "innocent" an evocative frame. Do you have suggestions for an alternative to "selfish"?

Is the argument here supposed to be particular to meta-normativity, or is it something more like "I generally think that there are philosophy facts, those seem kind of a priori-ish and not obviously natural/normal, so maybe a priori normative facts are OK too, even if we understand neither of them"? 

Re: meta-philosophy, I tend to see philosophy as fairly continuous with just "good, clear thinking" and "figuring out how stuff hangs together," but applied in a very general way that includes otherwise confusing stuff. I agree various philosophical domains feel pretty a priori-ish, and I don't have a worked out view of a priori knowledge, especially synthetic a priori knowledge (I tend to expect us to be able to give an account of how we get epistemic access to analytic truths). But I think I basically want to make the same demands of other a priori-ish domains that I do normativity. That is, I want the right kind of explanatory link between our belief formation and the contents of the domain -- which, for "realist" construals of the domain, I expect to require that the contents of the domain play some role in explaining our beliefs. 

Re: the relationship between meta-normativity and normativity in particular, I wonder if a comparison to the relationship between "meta-theology" and "theology" might be instructive here. I feel like I want to be fairly realist about certain "meta-theological facts" like "the God of Christianity doesn't exist" (maybe this is just a straightforward theological fact?). But this doesn't tempt me towards realism about God. Maybe talking about normative "properties" instead of normative facts would be easier here, since one can imagine e.g. a nihilist denying the existence of normative properties, but accepting some 'normative' (meta-normative?) facts like "there is no such thing as goodness" or "pleasure is not good."

Reviewers ended up on the list via different routes. A few we solicited specifically because we expected them to have relatively well-developed views that disagree with the report in one direction or another (e.g., more pessimistic, or more optimistic), and we wanted to understand the best objections in this respect. A few came from trying to get information about how generally thoughtful folks with different backgrounds react to the report. A few came from sending a note to GPI saying we were open to GPI folks providing reviews. And a few came via other miscellaneous routes. I’d definitely be interested to see more reviews from mainstream ML researchers, but understanding how ML researchers in particular react to the report wasn’t our priority here.

Cool, these comments helped me get more clarity about where Ben is coming from. 

Ben, I think the conception of planning I’m working with is closest to your “loose” sense. That is, roughly put, I think of planning as happening when (a) something like simulations are happening, and (b) the output is determined (in the right way) at least partly on the basis of those simulations (this definition isn’t ideal, but hopefully it’s close enough for now). Whereas it sounds like you think of (strict) planning as happening when (a) something like simulations are happening, and (c) the agent’s overall policy ends up different (and better) as a result. 

What’s the difference between (b) and (c)? One operationalization could be: if you gave an agent input 1, then let it do its simulations thing and produce an output, then gave it input 1 again, could the agent’s performance improve, on this round, in virtue of the simulation-running that it did on the first round? On my model, this isn’t necessary for planning; whereas on yours, it sounds like it is? 

Let’s say this is indeed a key distinction. If so, let’s call my version “Joe-planning” and your version “Ben-planning.” My main point re: feedforward neural network was that they could do Joe-planning in principle, which it sounds like you think at least conceivable. I agree that it seems tough for shallow feedforward networks to do much of Joe-planning in practice. I also grant that when humans plan, they are generally doing Ben-planning in addition to Joe-planning (e.g., they’re generally in a position to do better on a given problem in virtue of having planned about that same problem yesterday).

Seems like key questions re: the connection to AI X-risk include:

  1. Is there reason to think a given type of planning especially dangerous and/or relevant to the overall argument for AI X-risk?
  2. Should we expect that type of planning to be necessary for various types of task performance?

Re: (1), I do think Ben-planning poses dangers that Joe-planning doesn’t. Notably, Ben planning does indeed allow a system to improve/change its policy "on its own" and without new data, whereas Joe planning need not — and this seems more likely to yield unexpected behavior. This seems continuous, though, with the fact that a Ben-planning agent is learning/improving its capabilities in general, which I flag separately as an important risk factor.

Another answer to (1), suggested by some of your comments, could appeal to the possibility that agents are more dangerous when you can tweak a single simple parameter like “how much time they have to think” or “search depth” and thereby get better performance (this feels related to Eliezer’s worries about “turning up the intelligence dial” by “running it with larger bounds on the for-loops”). I agree that if you can just “turn up the intelligence dial,” that is quite a bit more worrying than if you can’t — but I think this is fairly orthogonal to the Joe-planning vs. Ben-planning distinction. For example, I think you can have Joe-planning agents where you can increase e.g. their search depth by tweaking a single parameter, and you can have Ben-planning agents where the parameters you’d need to tweak aren’t under your control (or the agent’s control), but rather are buried inside some tangled opaque neural network you don't understand.

The central reason I'm interested in Joe-planning, though, is that I think the instrumental convergence argument makes the most sense if Joe-planning is involved -- e.g., if the agent is running simulations that allow it to notice and respond to incentives to seek power (there are versions of the argument that don't appeal to Joe-planning, but I like these less -- see discussion in footnote 87 here). It's true that you can end up power-seeking-ish via non-Joe-planning paths (for example, if in training you developed sphex-ish heuristics that favor power-seeking-ish actions); but when I actually imagine AI systems that end up power-seeking, I imagine it happening because they noticed, in the course of modeling the world in order to achieve their goals, that power-seeking (even in ways humans wouldn't like) would help.

Can this happen without Ben-planning? I think it can. Suppose, for example, that none of your previous Joe-planning models were power-seeking. Then, you train a new Joe-planner, who can run more sophisticated simulations. On some inputs, this Joe-planner realizes that power-seeking is advantageous, and goes for it (or starts deceiving you, or whatever).

Re: (2), for the reasons discussed in section 3.1, I tend to see Joe-planning as pretty key to lots of task-performance — though I acknowledge that my intuitions are surprised by how much it looks like you can do via something more intuitively “sphexish.” And I acknowledge that some of those arguments may apply less to Ben-planning. I do think this is some comfort, since agents that learn via planning are indeed scarier. But I am separately worried that ongoing learning will be very useful/incentivized, too.

I’m glad you think it’s valuable, Ben — and thanks for taking the time to write such a thoughtful and detailed review. 

I’m sympathetic to the possibility that the high level of conjuctiveness here created some amount of downward bias, even if the argument does actually have a highly conjunctive structure.” 

Yes, I am too. I’m thinking about the right way to address this going forward. 

I’ll respond re: planning in the thread with Daniel.

(Note that my numbers re: short-horizon systems + 12 OOMs being enough, and for +12 OOMs in general, changed since an earlier version you read, to 35% and 65% respectively.)

Thanks for these comments.

that suggests that CrystalNights would work, provided we start from something about as smart as a chimp. And arguably OmegaStar would be about as smart as a chimp - it would very likely appear much smarter to people talking with it, at least.

"starting with something as smart as a chimp" seems to me like where a huge amount of the work is being done, and if Omega-star --> Chimp-level intelligence, it seems a lot less likely we'd need to resort to re-running evolution-type stuff. I also don't think "likely to appear smarter than a chimp to people talking with it" is a good test, given that e.g. GPT-3 (2?) would plausibly pass, and chimps can't talk. 

"Do you not have upwards of 75% credence that the GPT scaling trends will continue for the next four OOMs at least? If you don't, that is indeed a big double crux." -- Would want to talk about the trends in question (and the OOMs -- I assume you mean training FLOP OOMs, rather than params?). I do think various benchmarks are looking good, but consider e.g. the recent Gopher paper

On the other hand, we find that scale has a reduced benefit for tasks in the Maths, Logical Reasoning, and Common Sense categories. Smaller models often perform better across these categories than larger models. In the cases that they don’t, larger models often don’t result in a performance increase. Our results suggest that for certain flavours of mathematical or logical reasoning tasks, it is unlikely that scale alone will lead to performance breakthroughs. In some cases Gopher has a lower performance than smaller models– examples of which include Abstract Algebra and Temporal Sequences from BIG-bench, and High School Mathematics from MMLU.

(Though in this particular case, re: math and logical reasoning, there are also other relevant results to consider, e.g. this and this.) 

It seems like "how likely is it that continuation of GPT scaling trends on X-benchmarks would result in APS-systems" is probably a more important crux, though?

Re: your premise 2, I had (wrongly, and too quickly) read this as claiming "if you have X% on +12 OOMs, you should have at least 1/2*X% on +6 OOMs," and log-uniformity was what jumped to mind as what might justify that claim. I have a clearer sense of what you were getting at now, and I accept something in the vicinity if you say 80% on +12 OOMs (will edit accordingly). My +12 number is lower, though, which makes it easier to have a flatter distribution that puts more than half of the +12 OOM credence above +6. 

The difference between 20% and 50% on APS-AI by 2030 seems like it could well be decision-relevant to me (and important, too, if you think that risk is a lot higher in short-timelines worlds). 

Load More