All of Chris van Merwijk's Comments + Replies

One of the more interesting dynamics of the past eight-or-so years has been watching a bunch of the people who [taught me my values] and [served as my early role models] and [were presented to me as paragons of cultural virtue] going off the deep end.

I'm curious who these people are.

We should expect regression towards the mean only if the tasks were selected for having high "improvement from small to Gopher-7". Were they?

The reasoning was given in the comment prior to it, that we want fast progress in order to get to immortality sooner.

"But yeah, I wish this hadn't happened."

Who else is gonna write the article? My sense is that no one (including me) is starkly stating publically the seriousness of the situation. 

"Yudkowsky is obnoxious, arrogant, and most importantly, disliked, so the more he intertwines himself with the idea of AI x-risk in the public imagination, the less likely it is that the public will take those ideas seriously"
 

I'm worried about people making character attacks on Yudkowsky (or other alignment researchers) like this. I think the people who think they can ... (read more)

4Daniel Kokotajlo3mo
I agree that there's a need for this sort of thing to be said loudly. (I've been saying similar things publicly, in the sense of anyone-can-go-see-that-I-wrote-it-on-LW, but not in the sense of putting it into major news outlets that are likely to get lots of eyeballs) I do agree with that. I think Yudkowsky, despite his flaws,* is a better human being than most people, and a much better rationalist/thinker. He is massively underrated. However, given that he is so disliked, it would be good if the Public Face of AI Safety was someone other than him, and I don't see a problem with saying so. (*I'm not counting 'being disliked' as a flaw btw, I do mean actual flaws--e.g. arrogance, overconfidence.)

"We finally managed to solve the problem of deceptive alignment while being capabilities competitive"

??????

-1Noosphere893mo
Good question to ask, and I'll explain. So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term. So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities. And in a sense, that's what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it's either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes. In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn't incentivized deceptive alignment.

"But I don't think you even need Eliezer-levels-of-P(doom) to think the situation warrants that sort of treatment."

Agreed. If a new state develops nuclear weapons, this isn't even close to creating a 10% x-risk, yet the idea of airstrikes on nuclear enrichment facillities, even though it is very controversial, has for a long time very much been an option on the table.

"if I thought the chance of doom was 1% I'd say "full speed ahead!"

This is not a reasonable view. Not on Longtermism, nor on mainstream common sense ethics. This is the view of someone willing to take unacceptable risks for the whole of humanity. 

2pseud3mo
Why not ask him for his reasoning, then evaluate it? If a person thinks there's 10% x-risk over the next 100 years if we don't develop superhuman AGI, and only a 1% x-risk if we do, then he'd suggest that anybody in favour of pausing AI progress was taking "unacceptable risks for the whole of himanity".

Also, there is a big difference between "Calling for violence", and "calling for the establishment of an international treaty, which is to be enforced by violence if necessary". I don't understand why so many people are muddling this distinction.

You are muddling the meaning of "pre-emptive war", or even "war". I'm not trying to diminish the gravity of Yudkowsky's proposal, but a missile strike on a specific compound known to contain WMD-developing technology is not a "pre-emptive war" or "war". Again I'm not trying to diminish the gravity, but this seems like an incorrect use of the term.

"For instance, personally I think the reason so few people take AI alignment seriously is that we haven't actually seen anything all that scary yet. "

And if this "actually scary" thing happens, people will know that Yudkowsky wrote the article beforehand, and they will know who the people are that mocked it.

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

2abramdemski3mo
A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.

Therefore, the waluigi eigen-simulacra are attractor states of the LLM

It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction. 

Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples  0 i.i.d. with 100%). 


Suppose we use a perfect Bayesian reasoner to sample bitstrings, bu... (read more)

2Cleo Nardo3mo
Yep, you're correct. The original argument in the Waluigi mega-post was sloppy. * If μ updated the amplitudes in a perfectly bayesian way and the context window was infinite, then the amplitudes of each premise must be a martingale. But the finite context breaks this. [https://www.lesswrong.com/posts/7qSHKYRnqyrumEfbt/remarks-1-18-on-gpt-compressed#Remark_9__Amplitudes_are_approximately_martingale] * Here is a toy model which shows how the finite context window leads to Waluigi Effect. [https://www.lesswrong.com/posts/7qSHKYRnqyrumEfbt/remarks-1-18-on-gpt-compressed#6_B__A_waluigi_toy_model]Basically, the finite context window biases the Dynamic LLM towards premises which can be evidenced by short strings (e.g. waluigi), and biases away from premises which can't be evidenced by short strings (e.g. luigis). * Regarding your other comment, a long context window doesn't mean that the waluigis won't appear quickly. Even with an infinite context window, the waluigi might appear immediately. The assumption that the context window is short/finite is only necessary to establish that the waluigi is an absorbing state but luigi isn't.

Linking to my post about Dutch TV: https://www.lesswrong.com/posts/TMXEDZy2FNr5neP4L/datapoint-median-10-ai-x-risk-mentioned-on-dutch-public-tv

"When LessWrong was ~dead"

Which year are you referring to here?

4Ben Pace3mo
2016-17 Added: To give context, here's a list of number of LW posts by year: * 2009: 852 * 2010: 1143 * 2011: 3002 * 2012: 2583 * 2013: 1973 * 2014: 1797 * 2015: 2002 (<– This should be ~1880, as we added all ~120 HPMOR posts and backdated them to 2015) * 2016: 1303 (<– This is the most 'dead' year according to me, and the year with the fewest posts) * 2017: 1671 (<– LW 2.0 revived in the second half of this year) * 2018: 1709 * 2019: 2121 * 2020: 3099 * 2021: 3230 * 2022: 4538 * First quarter of 2023: 1436, if you 4x that it is 5744  (My, it's getting to be quite a lot of posts these days.)

A lot of people in AI Alignment I've talked to have found it pretty hard to have clear thoughts in the current social environment, and many of them have reported that getting out of Berkeley, or getting social distance from the core of the community has made them produce better thoughts.

What do you think is the mechanism behind this?

I think the biggest thing is a strong, high-stakes but still quite ambiguous status-hierarchy in the Bay Area.

I think there are lots of contributors to this, but I definitely feel a very huge sense of needing to adopt certain views, to display "good judgement", and to conform to a bunch of epistemic and moral positions in order to operate in the space. This is particularly harsh since the fall of FTX with funding being less abundant and a lot of projects being more in-peril and the stakes of being perceived as reasonable and competent by a very messy and in-substantial parts social process are even higher.

There is a general phenomenon where:

  • Person A has mental model X and tries to explain X with explanation Q
  • Person B doesn't get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn't actually contain the insights, but P does.
  • Person C doesn't get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: ...

It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contai... (read more)

2TurnTrout16d
I want to note that I just reread Utility ≠ Reward [https://www.lesswrong.com/posts/bG4PR9uSsZqHg2gYY/utility-reward] and was pleasantly surprised by its treatment, as well as the hedges. I'm making an upwards update on these points having been understood by at least some thinkers, although I've also made a lot of downward updates for other reasons.

Very late reply, sorry.

"even though reward is not a kind of objective", this is a terminological issue. In my view, calling a "antecedent-computation reinforcement criterion" an "objective" matches my definition of "objective", and this is just a matter of terminology. The term "objective" is ill-defined enough that "even though reward is not a kind of objective" is a terminological claim about objective, not a claim about math/the world.

The idea that RL agents "reinforce antecedent computations" is completely core to our story of deception. You could not ... (read more)

2TurnTrout3mo
I think most ML practitioners do have implicit models of how reward chisels computation into agents, as seen with how they play around with e.g. reward shaping and such. It's that I don't perceive this knowledge to be engaged when some people reason about "optimization processes" and "selecting for high-reward models" on e.g. LW.  I just continue to think "I wouldn't write RFLO the way it was written, if I had deeply and consciously internalized the lessons of OP", but it's possible this is a terminological/framing thing. Your comment does update me some, but I think I mostly retain my view here. I do totally buy that you all had good implicit models of the reward-chiseling point. FWIW, I think a bunch of my historical frustration here has been an experience of: 1. Pointing out the "reward chisels computation" point 2. Having some people tell me it's obvious, or already known, or that they already invented it  3. Seeing some of the same people continue making similar mistakes (according to me) 4. Not finding instances of other people making these points before OP 5. Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so. 1. If I found several comments explaining what is clearly the "reward chisels computation" point, where the comments were posted before this post, by people who weren't me or downstream of my influence, I would update against my points being novel and towards my points using different terminology.  2. IIRC there's one comment from Wei_Dai from a few years back in this vein, but IDK of others.

The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).

Is the following a typo?
"So, the  ( works"

first sentence of "CoCo Equilbiria".

2Diffractor3mo
It was a typo! And it has been fixed.

Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.

Is there a difference between saying:

  • A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't ac
... (read more)
4TurnTrout10mo
Where did RFLO point it out? RFLO talks about a mesa objective being different from the "base objective" (even though reward is not a kind of objective). IIRC on my skim most of the arguments were non-mechanistic reasoning about what gets selected for. (Which isn't a knockdown complaint, but those arguments are also not about the mechanism.) Also see my comment to Evan. [https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target?commentId=zcv9YsJ8EWxdn9hM8] Like, from my POV, people are reliably reasoning about what RL "selects for" via "lots of optimization pressure" on "high reward by the formal metric", but who's reasoning about what kinds of antecedent computations get reinforced when credit assignment activates? Can you give me examples of anyone else spelling this out in a straightforward fashion? Yeah, I think it just doesn't communicate the mechanistic understanding (not even imperfectly, in most cases, I imagine). From my current viewpoint, I just wouldn't call reward an objective at all, except in the context of learned antecedent-computation-reinforcement terminal values. It's like if I said "My cake is red" when the cake is blue, I guess? IMO it's just not how to communicate the concept. Why is this reasonable? 

It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?

(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).

4TurnTrout10mo
"Wireheading is improbable" is only half of the point of the essay.  The other main point is "reward functions are not the same type of object as utility functions." I haven't reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as "objectives": Which is reasonable parlance, given that everyone else uses it, but I don't find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an 'objective' at all.  (You might have privately known about this distinction. Fine by me! But I can't back it out from a skim of RFLO, even already knowing the insight and looking for it.)

"I think in the defense-offense case the actions available to both sides are approximately the same"

If attacker has the action "cause a 100% lethal global pandemic" and the defender has the task "prevent a 100% lethal global pandemic", then clearly these are different problems, and it is a thesis, a thing to be argued for, that the latter requires largely the same skills/tech as the former (which is what this offense-defense symmetry thesis states). 

If you build an OS that you're trying to make safe against attacks, you might do e.g. what the seL4 mic... (read more)

2Rohin Shah1y
Yes, all of that mostly sounds right to me. I agree the formal strategy-stealing argument relies on literal symmetry; I would say the linked post is applying it to asymmetric situations, where you can recover something roughly symmetric, by assuming that both players need to first accumulate resources and power. (I think this is basically what you said.)

Kind of a delayed response, but: Could you clarify what you think is the relation between that post and mine? I think they are somehow sort of related, but not sure what you think the relation is. Are you just trying to say "this is sort of related", or are you trying to say "the strategy stealing assumption and this defense-offense symmetry thesis is the same thing"?

In the latter case: I think they are not the same thing, neither in terms of their actual meaning nor their intended purpose:

  • Strategy-stealing assumption is (in the context of AI alignment): f
... (read more)
2Rohin Shah1y
... And the humans have a majority of the resources / power, which requires having competitive aligned AI systems. More broadly strategy-stealing is "the player with majority resources / power can just copy the strategy of the other player". I wouldn't say the strategy-stealing assumption is about a symmetric game; it's symmetric only in that the actions available to both sides are approximately the same. The goals of the two sides are pretty different and aren't zero-sum. Similarly I think in the defense-offense case the actions available to both sides are approximately the same but the goals are pretty different (defend X vs attack X). The strategy-stealing argument as applied to defense-offense would say something like "whatever offense does to increase its resources / power is something that defense could also do to increase resources / power". E.g. if the terrorists secretly go around shooting people to decrease state power, the state could also go around secretly shooting terrorists to decrease terrorist power. Often the position with majority resources / power (i.e. the state) will have a better action than that available, and so you'll see the two groups doing different things, but "use the same strategy as the less-resourced group" is an available baseline that helps you preserve your majority resources / power. This isn't the same as your thesis. Your thesis says "the defender needs to have the same capabilities as the attacker". The strategy-stealing argument directly assumes that the defender has the same capabilities (i.e. assumes the conclusion of your thesis), and then uses that to argue that there is a lower bound on how well the majority-resourced player can do. So anyway I'd say the relation is that both theses are talking about the same sort of game / environment, and defense-offense is a central example application of the strategy-stealing argument (especially in AI alignment, where humanity + aligned AI are defending against misaligned AI at

I just had a very quick look at that site, and it seems to be a collection of various chip models with pictures of them? Is there actual information on quantities sold, etc? I couldn't find it immediately.

5Lone Pine1y
I just found this: http://www.transistorcount.com/ [http://www.transistorcount.com/]
1Lone Pine1y
Nope. It's a site by and for collectors, and apparently what they care about is reference images of the face of old chips. You'd think that ChipDB would be a database of chips, but this one is sorely lacking. I added this comment in hopes that someone knows of a more useful (to us) database.

Yeah, I know they don't understand them comprehensively. Is this the point though? I mean they understand them at a level of abstraction necessary to do what they need, and the claim is they have basically the same kind of knowledge of computers. Hmm, I guess that isn't really communicated by my phrasing though, so maybe I should edit that

I think I communicated unclearly and it's my fault, sorry for that: I shouldn't have used the phrase "any easily specifiable task" for what I meant, because I didn't mean it to include "optimize the entire human lightcone w.r.t. human values". In fact, I was being vague and probably there isn't really a sensible notion that I was trying to point to. However, to clarify what I really was trying to say: What I mean by "hard problem of alignment" is : "develop an AI system that keeps humanity permanently safe from misaligned AI (and maybe other x risks), and ... (read more)

I'm surprised if I haven't made this clear yet, but the thing that (from my perspective) seems different between my and your view is not that Step 1 seems easier to me than it seems to you, but that the "melt the GPUs" strategy (and possibly other pivotal acts one might come up with) seems way harder to me than it seems to you. You don't have to convince me of "'any easily human-specifiable task' is asking for a really mature alignment", because in my model this is basically equivalent to fully solving the hard problem of AI alignment. 

Some reasons:

  • I
... (read more)
2Rob Bensinger1y
This seems very implausible to me. One task looks something like "figure out how to get an AGI to think about physics within a certain small volume of space, output a few specific complicated machines in that space, and not think about or steer the rest of the world". The other task looks something like "solve all of human psychology and moral philosophy, figure out how to get an AGI to do arbitrarily specific tasks across arbitrary cognitive domains with unlimited capabilities and free reign over the universe, and optimize the entire future light cone with zero opportunity to abort partway through if you screw anything up". The first task can be astoundingly difficult and still be far easier than that. If you're on the Moon, on Mars, deep in the Earth's crust, etc., or if you've used AGI to build fast-running human whole-brain emulations, then you can go without AGI-assisted modeling like that for a very long time (and potentially indefinitely). None of the pivotal acts that seem promising to me involve any modeling of humans, beyond the level of modeling needed to learn a specific simple physics task like 'build more advanced computing hardware' or 'build an artificial ribosome'. If humanity has solved the weak alignment problem, escaped imminent destruction via AGI proliferation, and ended the acute existential risk period, then we can safely take our time arguing about what to do next, hashing out whether the pivotal act that prevented the death of humanity violated propriety, etc. If humanity wants to take twenty years to hash out that argument, or for that matter a hundred years, then go wild! I feel optimistic about the long-term capacity of human civilization to figure things out, grow into maturity, and eventually make sane choices about the future, if we don't destroy ourselves. I'm much more concerned with the "let's not destroy ourselves" problem than with the finer points of PR and messaging when it comes to discussing afterwards whatever it was so

"you" obviously is whoever would be building the AI system that ended up burning all the GPU's (and ensuring no future GPU's are created). I don't know such sequence of events just as I don't know the sequence of events for building the "burn all GPU's" system, except at the level of granularity of "Step 1. build a superintelligent AI system that can perform basically any easily human-specifiable task without destroying the world. Step 2. make that system burn all GPU's indefintely/build security services that prevent misaligned AI from destroying the worl... (read more)

2Rob Bensinger1y
I'd guess this is orders of magnitude harder than, e.g., 'build an AGI that can melt all the GPUs, build you a rocket to go to the Moon, and build you a Moon base with 10+ years of supplies'. Both sound hard, but 'any easily human-specifiable task' is asking for a really mature alignment science in your very first AGI systems -- both in terms of 'knowing how to align such a wide variety of tasks' (e.g., you aren't depending on 'the system isn't modeling humans' as a safety assumption), and in terms of 'being able to actually do the required alignment work on fairly short timescales'. If we succeed in deploying aligned AGI systems, I expect the first such systems to be very precariously aligned -- just barely able to safely perform a very minimal, limited set of tasks. I expect humanity, if it survives at all, to survive by the skin of our teeth. Adding any extra difficulty to the task (e.g., an extra six months of work) could easily turn a realistic success scenario into a failure scenario, IMO. So I actually expect it to matter quite a lot exactly how much extra research and engineering work and testing we require; we may not be able to afford to waste a month.

I wonder if there is a bias induced by writing this on a year-by-year basis, as opposed to some random other time interval, like 2 years. I can somehow imagine that if you take 2 copies of a human, and ask one to do this exercise in yearly intervals, and the other to do it in 2-year intervals, they'll basically tell the same story, but the second one's story takes twice as long. (i.e. the second one's prediction for 2022/2024/2026 are the same as the first one's predictions for 2022/2023/2024). It's probably not that extreme, but I would be surprised if there was zero such effect, which would mean these timelines are biased downwards or upwards.

2Daniel Kokotajlo1y
Probably there's all sorts of subtle biases, yeah. It would be cool to see a more rigorous evaluation of them by e.g. getting a bunch of humans to generate stories with different methodologies.

yeah, I probably overstated. Nevertheless:

"CEV seems way harder to me than ..."
yes, I agree it seems way harder, and I'm assuming we won't need to do it and that we could instead "run CEV" by just actually continuing human society and having humans figure out what they want, etc. It currently seems to me that the end game is to get to an AI security service (in analogy to state security services) that protects the world from misaligned AI, and then let humanity figure out what it wants (CEV). The default is just to do CEV directly by actual human brains, b... (read more)

2Rob Bensinger1y
Who is "you"? What sequence of events are you imagining resulting in a permanent security service (= a global surveillance and peacekeeping force?) that prevents AGI from destroying the world, without an AGI-enabled pivotal act occurring?

Ok I admit I read over it. I must say though that this makes the whole thing more involved than it sounded at fist, since it would maybe require essentially escalating a conflict with all major military powers and still coming out on top? One possible outcome of this would be that the entire global intellectual public opinion turns against you, meaning you also possibly lose access to a lot of additional humans working with you on further alignment research? I'm not sure if I'm imagining it correctly, but it seems like this plan would either require so many elements that I'm not sure if it isn't just equivalent to solving the entire alignment problem, or otherwise it isn't actually enough.

2Rob Bensinger1y
This seems way too extreme to me; I expect the full alignment problem to take subjective centuries to solve. CEV seems way harder to me than, e.g., 'build nanotech that helps you build machinery to relocate your team and your AGI to the Moon, then melt all the GPUs on Earth'. Leaving the Earth is probably overkill for defensive purposes, given the wide range of defensive options nanotech would open up (and the increasing capabilities gap as more time passes and more tasks become alignable). But it provides another proof of concept that this is a much, much simpler engineering feat than aligning CEV and solving the whole of human values. Separately, I do in fact think it's plausible that the entire world would roll over (at least for ten years or so) in response to an overwhelming display of force of that kind, surprising and counter-intuitive as that sounds. I would feel much better about a plan that doesn't require this assumption; but there are historical precedents for world powers being surprisingly passive and wary-of-direct-conflict in cases like this.

But assuming that law enforcement figures out that you did this, then puts you in jail, you wouldn't be able to control the further use of such nanotech, i.e. there would just be a bunch of systems indefinitely destroying GPU's, or maybe you set a timer or some conditions on it or something. I certainly see no reason why Iceland or anyone in iceland could get away with this unless those systems rely on completely unchecked nanosystems to which the US military has no response. Maybe all of this is what Eliezer means by "melt the GPU's", but I thought he did... (read more)

3Rob Bensinger1y
This would violate Eliezer's condition "including the reaction of existing political entities to that event". If Iceland melts all the GPUs but then the servers its AGI is running on get bombed, or its AGI researchers get kidnapped or arrested, then I assume that the attempted pivotal act failed and we're back to square one. (I assume this because (a) I don't expect most worlds to be able to get their act together before GPUs proliferate again and someone destroys the world with AGI; and (b) I assume there's little chance of Iceland recovering from losing its AGI or its AGI team.)

I meant, is there a link to where you've written this down somewhere? Maybe you just haven't written it down. 

2Daniel Kokotajlo1y
I'll send you a DM.

I would be interested in reading a draft and giving feedback (FYI I'm currently a researcher in the AI safety team at FHI). 

1Darren McKee1y
Thank you.  I'll follow up. 

I'm also interested to read the draft, if you're willing to send it to me.

Here is my partial honest reaction, just two points I'm somewhat dissatisfied with (not meant to be exhaustive):
2. "A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure." I would like there to be an argument for this claim that doesn't rely on nanotech, and solidly relies on actually existing amounts of compute. E.g. if the argument relies on running intractable detailed simulations of prote... (read more)

3Rob Bensinger1y
From an Eliezer comment [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=6nhzKFaW2ByW3kGeY]: If Iceland did this, it would plausibly need some way to (1) not have its AGI project bombed in response, and (2) be able to continue destroying GPUs in the future if new ones are built, until humanity figures out 'what it wants to do next'. This more or less eliminates the time pressure to rush figuring out what to do next, which seems pretty crucial for good long-term outcomes. It's a much harder problem than just 'cause all GPUs to stop working for a year as a one-time event', and I assume Eliezer's focusing on nanotech it part because it's a very general technology that can be used for tasks like those as well.

"I have sat down to make toy models .."

reference?

6Daniel Kokotajlo1y
? I am the reference, I'm describing a personal experience.

"which is to make a truly remarkable universal claim with a heavy burden of proof."

Having thought about this way less than you, it doesn't seem at first sight to me as remarkable as you seem to say. Note that the claim wouldn't be that you can't write a set of prompts to get the fully unversal reasoner, but that you can't write a single prompt that gets you this universal reasoner. It doesn't sound so crazy to me at all that knowledge is dispersed in the network in a way that e.g. some knowledge can only be accessed if the prompt has the feel of being gene... (read more)

Responding to this very late, but: If I recall correctly, Eric has told me in personal conversation that CAIS is a form of AGI, just not agent-like AGI. I suspect Eric would agree broadly with Richard's definition.

2Rohin Shah1y
I agree that the set of services is intended to, in aggregate, perform any task (that's what the "Comprehensive" part of of "Comprehensive AI Services" means), and it shares that property with AGI (that's what the "General" part of "Artificial General Intelligence" means). There are other properties that Bostrom/Yudkowsky conceptions of AGI have that CAIS doesn't have, including "searching across long-term plans to find one that achieves a potentially-unbounded goal, which involves deceiving or overpowering humans if they would otherwise try to interfere". I don't particularly care what terminology we use; I just want us to note which properties a given system or set of systems does and does not have.

"I talk about consequentialists, but not rational consequentialists", ok this was not the impression I was getting. 

1Joe_Collman1y
Well I'm sure I could have been clearer. (and it's possible that I'm now characterising what I think, rather than what I wrote) But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. ('correct' meaning something like: [leads to best consequences, according to our values]) Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer. I think I focus on this, since it's the non-obvious part of the argument: it's already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer. Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.

Reading this post a while after it was written: I'm not going to respond to the main claim (which seems quite likely) but just to the specific arguments, which seems suspicious to me. Here are some points:

  • In my model of the standard debate setup with human judge, the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer. The fact that one answer provides more useful information than "2+2=?" doesn't imply a "direct" incentive for the human judge to select that as the correct answer. Upon introspec
... (read more)
1Joe_Collman1y
Yes, agreed. A few points on the rest: 1. At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn't necessarily strong evidence of misalignment. So "consequentialist judges will [sometimes correctly] select QIA's" is bad in the sense that it provides cover for "consequentialist judges will [sometimes incorrectly] select QIA's". 2. I talk about consequentialists, but not rational consequentialists. I expect the kind of judge we'd pick to be highly rational relative to the average human - but that's a low bar. I expect all humans to have exploitable inconsistencies, and that optimal play will exploit them. (similarly for groups of humans) So yes, this is only a problem where manipulation is possible - but since it is possible, we'll have difficulty distinguishing [judge correctly selected a non-answer as the winner for principled reasons] from [judge was manipulated...]. 1. It's much less clear when such issues show up with sub-optimal play. 3. With "Is this definitely undesirable? I'm not sure, but probably." I'm referring to the debate structure's having this property in general being undesirable. The judge can know it is undesirable in general, but also think that in this specific case things are different - and, of course, the judge can be wrong about this. 1. Noting here that humans can't make binding pre-commitments. (saying words doesn't qualify) 4. It's hard (/impossible?) to avoid this issue through oversight, since we just move from [QI exceptions that persuade the judge win] to [QI exceptions that persuade the judge and oversight system win].

yes, but I think your reasoning "If 2 is only talking about the map, it doesn't imply 3" is too vague. I'd rather not go into it though, because I am currently busy with other things, so I'd suggest letting the reader decide.

Edit: reading back my response, it might come accross as a bit rude. If so, sorry for that, I didn't mean it that way.

I think this is too vague, but I will drop this discussion and let the reader decide.

-1TAG1y
Do you know what "" non sequitur" means?

"But without the premise that the territory is maths, the rest of the paradox doesn't follow."

I explicitly said "mathematically describable" implying I am not identifying the theory with reality. Nothing in my "argument" makes this identification

1TAG1y
Then the argument fails through non sequitur. If 2 is only talking about the map, it doesn't imply 3.

If an object knows that it exists, then this implies that it actually exists. Moreover, assuming that the state of a brain is a mathematical fact about the mathematical theory, then that the object knows it exists is in principle a mathematical implication of the mathematical theory (if observation 2 is correct). Hence it would be an implication of the theory that that theory describes an existing reality. 

"There may also be mathematical properties that are universe-specific (the best candidates here are natural constants), but the extent to which these exist is questionable"

The exact position of every atom in the universe at time t=10^10 years is a "mathematical property of our universe" in my terminology. The fact that some human somewhere uttered the words "good morning" at some point today, is a complicated mathematical property of our universe, in principle derivable from the fundamental theory of physics. 

2Rafael Harth1y
Sure, but those properties are upstream of the laws of physics, so you don't need to figure them out to answer your main question.

tangential comment: Regarding "I will define success as producing fission weapons before the end of war in Europe". I'm not sure if this is the right criterion for success for the purpose of analogizing to AGI. It seems to me that "producing fission weapons before an Axis power does" is more appropriate.

And this seems overwhelmingly the case, yes: "theory of atomic bomb was considerably more advanced at the beginning of Manhattan project compared to our understanding of theory of aligned AGI"

I'm not sure I understand the motivation behind question. How much of my modern knowledge am I supposed to throw away? Note I am not in fact an atomic theorist who has the state of knowledge of atomic theory in 1942 so it's hard to know what I'd think, but I can imagine assigning somewhere between 5% and 95% depending on how informed of an atomic theorist I actually was and what it was actually like in 1942. Maybe I could give a better answer if you clarify the motivation behind the question?

1Yitz1y
I’m asking to try to imagine yourself as an atomic theorist who has access to the state of knowledge of atomic theory in 1942. Obviously that can’t be done perfectly, but my thought was that by modeling what you would have predicted vs what actually happened, some insight can be had about how “unknown unknowns” effect projects of that scale.
Load More