All of dxu's Comments + Replies

RE: decision theory w.r.t how "other powerful beings" might respond - I really do think Nate has already argued this, and his arguments continue to seem more compelling to me than the the opposition's. Relevant quotes include:

It’s possible that the paperclipper that kills us will decide to scan human brains and save the scans, just in case it runs into an advanced alien civilization later that wants to trade some paperclips for the scans. And there may well be friendly aliens out there who would agree to this trade, and then give us a little pocket of th

... (read more)

I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer's cognition. I think this disagreement (which I internally feel like I've already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:

As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/o

... (read more)
1simon1mo
My other reply [https://www.lesswrong.com/posts/Gn46SEKizaBxFzLN3/no-really-it-predicts-next-tokens?commentId=toDAcQwyoAFaDqd3x#aJJ428jsRXGCejsfE] addressed what I thought is the core of our disagreement, but not particularly your exact statements you make in your comment. So I'm addressing them here. Let me be clear that I am NOT saying that any inner optimizer, if it exists, would have a goal that is equal to minimizing the outer loss. What I am saying is that it would have a goal that, in practice, when implemented in a single pass of the LLM has the effect of of minimizing the LLM's overall outer loss with respect to that ONE token. And that it would be very hard for such a goal to cash out, in practice, to wanting long range real-world effects. Let me also point out your implicit assumption that there is an 'inner' cognition which is not literally the mask. Here is some other claim someone could make: This person would be saying, "hey look, this datacenter full of GPUs is carrying out this agentic-looking cognition. And, it could easily carry out other, completely different agentic cognition. Therefore, the datacenter must have these capabilities independently from the LLM and must have its own 'inner' cognition." I think that you are making the same philosophical error that this claim would be making. However, if we didn't understand GPUs we could still imagine that the datacenter does have its own, independent 'inner' cognition, analogous to, as I noted in a previous comment, John Searle in his Chinese room. And if this were the case, it would be reasonable to expect that this inner cognition might only be 'acting' for instrumental reasons and could be waiting for an opportunity to jump out and suddenly do something else other than running the LLM.  The GPU software is not tightly optimized specifically to run the LLM or an ensemble of LLMs and could indeed have other complications and who knows what it could end up doing? Because the LLM does super d
1simon1mo
OK, I think I'm now seeing what you're saying here (edit: see my other reply [https://www.lesswrong.com/posts/Gn46SEKizaBxFzLN3/no-really-it-predicts-next-tokens?commentId=x3diZsshCzEhgHn5p#aJJ428jsRXGCejsfE] for additional perspective and addressing particular statements made in your comment):  In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the "mask" and the actual machinery that generated it, which may in fact be the entire network, as the "actor".  Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of "mask" than I had before. However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model. In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask). While this "actor" is indeed not the same as any of the "masks", it doesn't know the answer "itself" to any of the questions. It needs to generate and "wear" the mask to do that. -------------------------------------------------------------------------------- This is not to deny that, in principle, the underlying temporary-model-generating machinery could

Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.

Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.

...yes? And this is obviously very, very different from how humans represent things internally?

I mean, for ... (read more)

2jacob_cannell1mo
I think we are starting to talk past each other, so let me just summarize my position (and what i'm not arguing): 1.) ANNs and BNNs converge in their internal representations, in part because of how physics only permits a narrow pareto efficient solution set, but also because ANNs are literally trained as distillations of BNNs. (More well known/accepted now, but I argued/predicted this well in advance (at least as early as 2015)). 2.) Because of 1.), there is no problem with 'alien thoughts' based on mindspace geometry [https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_on_the_width_of_mind_space_]. That was just never going to be a problem. 3.) Neither 1 or 2 are sufficient for alignment by default - both points apply rather obviously to humans, who are clearly not aligned by default with other humans or humanity in general. Earlier you said: I then pointed out that full SI on a hypercomputer would result in recreating entire worlds with human minds, but that was a bit of a tangent. The more relevant point is more nuanced: AIXI is SI plus some reward function. So all different possible AIXI agents share the exact same world model, yet they have different reward functions and thus would generate different plans and may well end up killing each other or something. So having exactly the same world model is not sufficient for alignment - I'm not and would never argue that But if you train a LLM to distill human thought sequences, those thought sequences can implicitly contain plans, value judgements or the equivalents. Thus LLMs can naturally align to human values to varying degrees, merely through their training as distillations of human thought. This of course by itself doesn't guarantee alignment, but it is a much more hopeful situation to be in, because you can exert a great deal of control through control of the training data.

Yeah, I'm growing increasingly confident that we're talking about different things. I'm not referring to about "masks" in the sense that you mean it.

I don't know what you mean by "one" or by "inner". I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (ag

... (read more)
1simon1mo
In my case your response made me much more confident we do have an underlying disagreement and not merely a clash of definitions. I think the most key disagreement is this: As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions. If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn't actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced. Likewise, the heuristics/"adaptations" that coalesced to form the optimizer would have been oriented towards answering the questions. All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a "goal slot" remains more parsimonious than an actor with a different underlying goal. Regarding the evolutionary analogy, while I'd generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution -> training and human behaviour/goals -> the mask. Note, it's entirely possible for a mask to be power seeking and we should presumably expect a mask that executes a takeover to be power-seeking. But this power seeking would come as a mask goal and not as a hidden goal learned by the model for underlying general power-seeking reasons.

I want to revisit what Rob actually wrote:

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability.

(emphasis mine)

That sounds a whole lot like it's invoking a simplicity prior to me!

Note I didn't actually reply to that quote. Sure that's an explicit simplicity prior. However there's a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).

LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.

This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also "learn from basically the same data" (sensory data produced by the physical universe) with "similar training objectives" (predict the next bit of sensory information) using "universal approximations of Bayesian inference" (a perfect approximation, in this cas... (read more)

Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.

Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.

you need to first investigate the actual internal representations of the systems in question, and verify that

... (read more)

E.g. a system capable of correctly answering questions like "given such-and-such chess position, what is the best move for the current player?" must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.

Yes, but that sort of question is in my view answered by the "mask", not by something outside the mask.

I don't think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from "the mask" or no... (read more)

1simon1mo
Yes.   I don't know what you mean by "one" or by "inner". I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask. I would not consider this case to be "one" inner optimizer since although most of the machinery is reused, it in practice acts differently and seeks different goals in each case, and I'm more concerned here with classifying things according to how they acts/what their effective goals are than the internal implementation details. What this multi-optimizer (which I would not call "inner") is going to "end up" wanting is whatever set of goals the particular mask has, that first has both desire and the capability to take over in some way. It's not going to be some mysterious inner thing. They aren't? In your example, the mask wanted to play chess, didn't it, and what you call the "inner" optimizer returned a good move, didn't it? I can see two things you might mean about the mask not actually being in control: 1. That there is some underlying goal that this optimizer has that is different than satisfying the current mask's goal, and it is only satisfying the mask's goal instrumentally. This I think is very unlikely for the reasons I put in the original post. It's extra machinery that isn't returning any value in training. 2. That this optimizer might at some times change goals (e.g. when the mask changes). It might well be the case that the same optimizing machinery is utilized by different masks, so the goals change as the mask does but again, if at each time it is optimizing a goal set by/according to the mask, it's better in my view to see it as part of/controlled by th

In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.

Isn't it enough that they do differ? Why do we need to be able to accurately/precisely characterize the nature of the difference, to conclude that an arbitrary inductive bias different from our own is unlikely to sample the same kinds of plans we do?

That's not at all clear to me. Inductive biases clearly differ between humans, yet we are not all terminally misaligned with each other. E.g., split brain patients are not all wired value aliens, despite a significant difference in architecture. Also, training on human-originated data causes networks to learn human-like inductive biases (at least somewhat).

It would conflict with a deceptive awake Shoggoth, but IMO such a thing is unlikely because the model is super-well optimized for next token prediction

Yeah, so I think I concretely disagree with this. I don't think being "super-well optimized" for a general task like sequence prediction (and what does it mean to be "super-well optimized" anyway, as opposed to "badly optimized" or some such?) means that inner optimizers fail to arise in the limit of sufficient ability, or that said inner optimizers will be aligned on the outer goal of sequence prediction... (read more)

1simon1mo
Yes, but that sort of question is in my view answered by the "mask", not by something outside the mask. The masks can indeed think whatever - in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example - though all is underlain by next-token prediction. It seems to me our disagreements might largely be in terms of what we are defining as the mask? 

I think I'm having some trouble parsing this, but not in a way that necessarily suggests your ideas are incoherent and/or bad—simply that your (self-admittedly) unusual communication style is making it hard for me to understand what you are saying.

It's possible you wrote this post the way you did because this is the way the ideas in question were natively represented in your brain, and translating them out of that representation and into something more third-party legible would have been effortful and/or infeasible. If so, there's plausibly not much to be ... (read more)

Gotcha. Thanks for explaining, in any case; I appreciate it.

With the caveat that I think this sort of “litigation of minutiae of nuance” is of very limited utility

Yeah, I think I probably agree.

would you consider “you A’d someone as a consequence of their B’ing” different from both the other two forms? Synonymous with them both? Synonymous with one but not the other?

Synonymous as far as I can tell. (If there's an actual distinction in your view, which you're currently trying to lead me to via some kind of roundabout, Socratic pathway, I'd appreciate skipping to the part where you just tell me what you think the distinction is.)

5Said Achmiz2mo
I had no such intention. It’s just that we already know that I think that X and Y seem like different things, and you think X and Y seem like the same thing, and since X and Y are the two forms which actually appeared in the referenced argument, there’s not much further to discuss, except to satisfy curiosity about the difference in our perceptions (which inquiry may involve positing some third thing Z). That’s really all that my question was about. In case you are curious in turn—personally, I’d say that “you A’d someone as a consequence of their B’ing” seems to me to be the same as “you A’d someone due to their B’ing”, but different from “you A’d someone for their B’ing”. As far as characterizing the distinction, I can tell you only that the meaning I, personally, was trying to convey was the difference in what sort of rule or principle was being applied. (See, for instance, the difference between “I shot him for breaking into my house” and “I shot him because he broke into my house”. The former implies a punishment imposed as a judgment for a transgression, while the latter can easily include actions taken in self-defense or defense of property, or even unintentional actions.) But, as I said, there is probably little point in pursuing this inquiry further.
dxu2mo1211

As a single point of evidence: it's immediately obvious to me what the difference is between "X is true" and "I think X" (for starters, note that these two sentences have different subjects, with the former's subject being "X" and the latter's being "I"). On the other hand, "you A'd someone due to their B'ing" and "you A'd someone for B'ing" do, actually, sound synonymous to me—and although I'm open to the idea that there's a distinction I'm missing here (just as there might be people to whom the first distinction is invisible), from where I currently stan... (read more)

4Said Achmiz2mo
With the caveat that I think this sort of “litigation of minutiae of nuance” is of very limited utility[1], I am curious: would you consider “you A’d someone as a consequence of their B’ing” different from both the other two forms? Synonymous with them both? Synonymous with one but not the other? -------------------------------------------------------------------------------- 1. I find that I am increasingly coming around to @Vladimir_Nesov [/users/vladimir_nesov?mention=user]’s stance [https://www.lesswrong.com/posts/D5BP9CxKHkcjA7gLv/speaking-of-stag-hunts#rEzcYgiF3KRZ53BuL] on [https://www.lesswrong.com/posts/gPPdYTwPkdvAEYyEK/fucking-goddamn-basics-of-rationalist-discourse#2TEmJWXLgr5gvon7j] nuance [https://www.lesswrong.com/posts/D5BP9CxKHkcjA7gLv/speaking-of-stag-hunts#8qGhyiNWbBurQd3dy]. ↩︎

Might I ask what you hoped to achieve in this thread by writing this comment?

dxu2mo1318

If so, I find this reasoning unconvincing

Why?

I mostly don't agree that "the pattern is clear"—which is to say, I do take issue with saying "we do not need to imagine counterfactuals". Here is (to my mind) a salient example of a top-level comment which provides an example illustrating the point of the OP, without the need for prompting.

I think this is mostly what happens, in the absence of such prompting: if someone thinks of a useful example, they can provide it in the comments (and accrue social credit/karma for their contribution, if indeed other... (read more)

6Said Achmiz2mo
Yep, indeed, that is an example, and a good one. But I linked a case of exactly the thing you just said won’t happen! I linked it in the comment you just responded to! Here is another example [https://www.lesswrong.com/posts/duxy4Hby5qMsv42i8/the-real-rules-have-no-exceptions#f7NMguzyBm4WC8chM]. Here are more examples: one [https://www.lesswrong.com/posts/bwkZD6uskCQBJDCeC/self-consciousness-wants-to-make-everything-about-itself#tc9HeiWrdG59MwagQ] two [https://www.lesswrong.com/posts/SX6wQEdGfzz7GKYvp/rationalist-discourse-is-like-physicist-motors#oy2mprLNMMiQNbjjH] three [https://www.lesswrong.com/posts/wJutA2czyFg6HbYoW/what-are-trigger-action-plans-taps#ag7mqdQBqJdhX2oE3] (and a bonus particularly interesting sort-of-example [https://www.lesswrong.com/posts/C6oNRFt4dvtM25vpw/living-nomadically-my-80-20-guide#Zs9T4CebjkLvsmvDy]) This is a weak response given that I am pointing to a pattern. A very suspicious reply, in the general case. Not always false, of course! But suspicious. If such a condition obtains, it ought to be pointed out explicitly, and defended. It is quite improper, and lacking in intellectual integrity, to simply rely on social censure against requests for examples to shield you from having to explain why in this case it so happens that you don’t need to point to any extensions for your proffered intensions. I agree that Duncan’s complaint includes this. I just think that he’s wrong about this. (And wrong in such a way that he should know that he’s wrong.) The burden is (a) not just on the author, but also on the reader (including the one who requested the examples!), and (b) not undue, but in fact quite the opposite. First, on the subject of “accompanying interpretive effort”: I think that such effort not only doesn’t reduce the cost to authors of responding, it can easily increase the cost. (See my previous commentary on the subject of “interpretive effort” for much expansion of this point.) Second, on the subject of “cost to the author

This, however, assumes that “formative evaluations” must be complete works by single contributors, rather than collaborative efforts contributed to by multiple commenters. That is an unrealistic and unproductive assumption, and will lead to less evaluative work being done overall, not more.

I am curious as to your assessment of the degree of work done by a naked "this seems unclear, please explain"?

My own assessment would place the value of this (and nothing else) at fairly close to zero—unless, of course, you are implicitly taking credit for some of the... (read more)

5Said Achmiz2mo
By “degree of work” do you mean “amount of effort invested” or “magnitude of effect achieved”? If the former, then the answer, of course, is “that is irrelevant”. But it seems like you mean the latter—yes? In which case, the answer, empirically, is “often substantial”. Essentially, yes. And we do not need to imagine counterfactuals, either; we can see this happen, often enough (i.e., some post will be written, and nobody asks for examples, and none are given, and no discussion of particulars ensues). Individual cases differ in details, of course, but the pattern is clear. Although I wouldn’t phrase it quite in terms of “taking credit” for the ensuing discussion. That’s not the point. The point is that the effect be achieved, and that actions which lead to the effect being achieved, be encouraged. If I write a comment like this one [https://www.lesswrong.com/posts/brQwWwZSQbWBFRNvh/how-to-use-bureaucracies#FAtAt4avZDFmCHfSo], and someone (as an aside, note, that in this case it was not the OP!) responds with comments like this one [https://www.lesswrong.com/posts/brQwWwZSQbWBFRNvh/how-to-use-bureaucracies#j7C5JmbhpP9boBrNe] and this one [https://www.lesswrong.com/posts/brQwWwZSQbWBFRNvh/how-to-use-bureaucracies#oshL8azidaBNEipgK], then of course it would be silly of me to say “I deserve the credit for those replies!”—no, the author of those replies deserves the credit for those replies. But insofar as they wouldn’t’ve have existed if I hadn’t posted my comment, then I deserve credit for having posted my comment. You are welcome to say “but you deserve less credit, maybe even almost no credit”; that’s fine. (Although, as I’ve noted before, the degree to which such prompts are appreciated and rewarded ought to scale with the likelihood of their counterfactual absence, i.e., if I hadn’t written that comment, would someone else have? But that’s a secondary point.) It’s fine if you want to assign me only epsilon credit. What’s not fine is if, instead, you debit me for

I like this post! Positive reinforcement. <3

You continue to assert things without justification, which is fine insofar as your goal is not to persuade others. And perhaps this isn't your goal! Perhaps your goal is merely to make it clear what your beliefs are, without necessarily providing the reasoning/evidence/argumentation that would convince a neutral observer to believe the same things you do.

But in that case, you are not, in fact, licensed to act surprised, and to call others "irrational", if they fail to update to your position after merely seeing it stated. You haven't actually given anyone ... (read more)

1Gerald Monroe2mo
You've done an excellent job of arguing your points.  It doesn't mean they are correct, however. Would you agree that if you made a perfect argument against the theory of relativity (numerous contemporary physicists did) it was still a waste of time? In this context, let's break open the object level argument.  Because only the laws of physics get a vote - you don't and I don't. The object level argument is that the worst of the below determines if foom is possible:      1.  Compute.  Right now there is a shortage of compute, and with a bit of rough estimating the shortage is actually pretty severe.  Nvidia makes approximately 60 million GPUs per year, of which 500k-1000k are A/H100s.  This is based on taking their data center revenue (source: wsj) and dividing by an estimated cost per chipset of (10k, 20k).  Compute production can be increased, but the limit would be all the world's 14nm or better silicon dedicated to producing AI compute.  This can be increased but it takes time.   Let's estimate how many worth of labor an AI system with access to all new compute (old compute doesn't matter due to a lack of interconnect bandwidth).  If a GPT-4 instance requires a full DGX "supercompute" node, which is 8 H100s with 80 Gb of memory each, (so approximately 1T weights in fp16), how much would it require for realtime multimodal operation?  Let's assume 4x the compute, which may be a gross underestimate.  So 8 more cards are running at least 1 robot in real time, 8 more are processing images for vision, and 8 more for audio i/o and helper systems for longer duration memory context.   So then if all new cards are used for inference, 1m/32 = 31,250 "instances" worth of labor.  Since they operate 24 hours a day this is equivalent to perhaps 100k humans?  If all of the silicon Nvidia has the contract rights to build is going into H100s, this scales by about 30 times, or 3m humans.  And most of those instances cannot be involved in world takeover efforts, they have to b
  1. one is straightforwardly true. Aging is going to kill every living creature. Aging is caused by complex interactions between biological systems and bad evolved code. An agent able to analyze thousands of simultaneous interactions, cross millions of patients, and essentially decompile the bad code (by modeling all proteins/ all binding sites in a living human) is likely required to shut it off, but it is highly likely with such an agent and with such tools you can in fact save most patients from aging. A system with enough capabilities to consider all
... (read more)
-7Gerald Monroe2mo

Categories like “conflicts of interest”, “discussions about who should be banned”, “arguments about moderation in cases in which you’re involved”, etc., already constitute “evidence” that push the conclusion away from the prior of “on the whole, people are more likely to say true things than false things”, without even getting into anything more specific.

The strength of the evidence is, in fact, a relevant input. And of the evidential strength conferred by the style of reasoning employed here, much has already been written.

You’ve misunderstood. My poi

... (read more)
1Said Achmiz2mo
Please see my reply to gjm [https://www.lesswrong.com/posts/kyDsgQGHoLkXz6vKL/lw-team-is-adjusting-moderation-policy#sMNFCgafbM4nhNS7c]. Yes. A strong default. I stand by what I said. A high one. This seems to me to be only an ordinarily high “default” confidence level, for things like this. See my above-linked reply to gjm, re: “the opinions of onlookers”. People on Less Wrong downvote for things other than “this is wrong”. You know this. (Indeed, this is wholly consonant with the designed purpose of the karma vote.) Likewise see my above-linked reply to gjm. I refer there to the three quote–reply pairs above that one. I must object to this. I don’t think what I’ve accused Duncan of can be fairly called “misconduct”. He’s broken no rules or norms of Less Wrong, as far as I can tell. Everything he’s done is allowed (and even, in some sense, encouraged) by the site rules. He hasn’t done anything underhanded or deliberately deceptive, hasn’t made factually false claims, etc. It does not seem to me that either Duncan, or Less Wrong’s moderation team, would consider any of his behavior in this matter to be blameworthy. (I could be wrong about this, of course, but that would surprise me.) Yes, of course. Duncan has said as much, repeatedly. It would be strange to disbelieve him on this. Just as obviously, I don’t agree with his characterization! (As before, see my above-linked reply to gjm for more details.) This seems clearly wrong to me. The operation is of course commutative; it doesn’t matter in the least whose name goes where. In any engagement between Alice and Bob, Alice can decide that Bob is engaging unproductively, at the same time as Bob decides that Alice is engaging unproductively. And of course Bob isn’t going to decide that it’s he who is the one engaging unproductively with Alice (and vice-versa). And both formulations can be summarized as “Bob decides that he is unlikely to engage in productive discussion with Alice” (regardless of whether B

I'm not sure what predictions you're making that are different than mine, other than maybe "a research program that skips NN's and just try to build the representations that they build up directly without looking at NNs has reasonable chances of success." Which doesn't seem like one you'd actually want to make.

I think I would, actually, want to make this prediction. The problem is that I'd want to make it primarily in the counterfactual world where the NN approach had been abandoned and/or declared off-limits, since in any world where both approaches ex... (read more)

This is a claim so general as to be meaningless. If we knew absolutely nothing except “a person said a thing”, then retreating to this sort of maximally-vague prior might be relevant. But we in fact are discussing a quite specific situation, with quite specific particular and categorical features. There is no good reason to believe that the quoted prior survives that descent to specificity unscathed (and indeed it seems clear to me that it very much does not).

The prior does in fact survive, in the absence of evidence that pushes one's conclusion away fr... (read more)

8Said Achmiz2mo
Categories like “conflicts of interest”, “discussions about who should be banned”, “arguments about moderation in cases in which you’re involved”, etc., already constitute “evidence” that push the conclusion away from the prior of “on the whole, people are more likely to say true things than false things”, without even getting into anything more specific. You’ve misunderstood. My point was that “Said keeps finding mistakes in what I have written” is a good first approximation (but only that!) of what Duncan allegedly finds unpleasant about interacting with me, not that it’s a good first approximation of Duncan’s description of same. A single circumspectly disagreeing comment on a tangential, secondary (tertiary? quaternary?) point, buried deep in a subthread, having minimal direct bearing on the claims in the post under which it’s posted. “Robust disagreement”, this ain’t. (Don’t get me wrong—it’s a fine comment, and I see that I strong-upvoted it at the time. But it sure is not anything at all like an example of the thing I asked for examples of.) Please do. So far, the example count remains at zero. Given that you did not, in fact, find an example, I think that this question remains unmotivated. Most people don’t bother to think about other people’s posts in sufficient detail and sufficiently critically to have anything much to say about them. Of the remainder, some agree with Duncan. Of the remainder of those, many don’t care enough to engage in arguments, disagreements, etc., of any sort. Of the remainder of those, many are either naturally disinclined to criticize forcefully, to press the criticism, to make points which are embarrassing or uncomfortable, etc., or else are deterred from doing so by the threat of moderation. That cuts the candidate pool down to a small handful. Separately, recall that Duncan has (I think more than once now) responded to similar situations by leaving (or “leaving”) Less Wrong. (What is the significance of his choice to

Your link looks broken; here's a working version.

(Note: your formatting looks correct to me, so I suspect the issue is that you're not using the Markdown version of the LW editor. If so, you can switch to that using the dropdown menu directly below the text input box.)

I think diverting people to a real-time discussion location like Discord could be more effective.

Agreed—which raises to mind the following question: does LW currently have anything like an official/primary public chatroom (whether hosted on Discord or elsewhere)? If not, it may be worth creating one, announcing it in a post (for visibility), and maintaining a prominently visible link to it on e.g. the sidebar (which is what many subreddits do).

Do you have preferred arguments (or links to preferred arguments) for/against these claims? From where I stand:

Point 1 looks to be less a positive claim and more a policy criticism (for which I'd need to know what specifically you dislike about the policy in question to respond in more depth), points 2 and 3 are straightforwardly true statements on my model (albeit I'd somewhat weaken my phrasing of point 3; I don't necessarily think agency is "automatic", although I do consider it quite likely to arise by default), point 4 seems likewise true, because the... (read more)

-6Gerald Monroe2mo

For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can't even predict basic things like "when will this system have situational awareness", etc.

Yes, and this can be framed as a consequence of a more general principle, which is that model uncertainty doesn't save you from pessimistic outcomes unless your prior (which after all is what you fall back t... (read more)

I would be interested in helping out with a newbie comment queue to keep it moving quickly so that newbies can have a positive early experience on lesswrong, whereas I would not want to volunteer for the "real" mod team because I don't have the requisite time and skills for reliably showing up for the more nuanced aspects of the role.

Were such a proposal to be adopted, I would be likewise willing to participate.

dxu2mo4026

The sequence starting with this post seemed to me at the time I read it to be a good summary of reasons to reject "Knightian" uncertainty as somehow special, and it continues to seem that way as of today.

Note that Richard is not treating knightian uncertainty as special and unquantifiable, but instead is giving examples of how to treat it like any other uncertainty, that he is explicitly quantifying and incorporating in his predictions.

I'd prefer calling Richard's "model error" to separate the two, but I'm also okay appropriating the term as Richard did to point to something coherent.

interpretability didn't progress at all, or that we know nothing about AI internals at all

No to the former, yes to the latter—which is noteworthy because Eliezer only claimed the latter. That's not a knock on interpretability research, when in fact Eliezer has repeatedly and publicly praised e.g. the work of Chis Olah and Distill. The choice to interpret the claim that we "know nothing about AI internals" as the claim that "no interpretability work has been done", it should be pointed out, was a reading imposed by ShardPhoenix (and subsequently by you).... (read more)

-4Noosphere892mo
I now see where the problem lies. The basic issues I see with this argument are as follows: 1. The implied argument is if you can't create something by yourself by hand in the field, you know nothing at all about what you are focusing on. This is straightforwardly not true for a lot of fields. For example, I'd probably know quite a lot about borderlands 3, not perfectly, but I actually have quite a bit of knowledge, and I even could use save editors or cheatware with video tutorials, but under nearly 0 circumstances could I actually create borderlands 3 even if the game with it's code already existed, even with a team. This likely generalizes: while neuroscience has some knowledge of the brain, it's not nearly at the point where it could reliably create a human brain from scratch, knowing some things about what cars do is not enough to create a working car, and so on. In general, I think the error is that you and Eliezer have too high expectations of what some knowledge will bring you. It helps, but in virtually no cases will the knowledge alone allow you to create the thing you are focusing on. It's possible that our knowledge of the AI's internal work isn't enough, and that progress is too slow. I might agree or disagree, but at least this would be rational. Right now, I'm seeing basic locally invalid arguments here, and I notice that part of the problem is that you and Eliezer have too much of a binary view on knowledge, where you either have functionally perfect knowledge or no knowledge at all, but usually our knowledge is neither functionally perfect, nor is it zero knowledge. Edit: This seems conceptually similar to P=NP, in that the problem is that verifying something and making something are conjectured to have very different difficulties, and essentially my claim is that verifying something isn't equal to generating something.
dxu2mo3928

Eliezer can't update well on evidence at all, especially if it contradicts doom (in this case it's not too much evidence against doom, but calling it zero evidence is inaccurate.)

I've noticed you repeating this claim in a number of threads, but I don't think I've seen you present evidence sufficient to justify it. In particular, the last time I asked you about this, your response was basically premised on "I think current (weak) systems are going to analogize very well to stronger systems, and this analogy carries the weight of my entire argument."

But i... (read more)

-1Noosphere892mo
While I agree that there are broader prior disagreements, I think that even if we isolate it to the question over whether Eliezer's statement was correct, without baking in priors, the statement that we have no knowledge of AI because they're inscrutable piles of numbers is verifiably wrong. To put it in Eliezer's words, it's a locally invalid argument, and this is known to to be false even without the broader prior disagreements. One could honestly say the interpretability progress isn't enough. One couldn't honestly say that interpretability didn't progress at all, or that we know nothing about AI internals at all without massive ignorance. This is poor news for his epistemics, because note that this is a verifiably wrong statement that Eliezer keeps making without any caveats or limitations. That's a big problem because if Eliezer can be both making confidently locally invalid arguments on AI, and persistently makes that locally invalid argument, then it calls into question over how well his epistemics are working on AI, and from my perspective there are really only bad outcomes here. It's not that Eliezer's wrong, it's that he is persistently, confidently wrong about something that's actually verifiable, such that we can point out the wrongness.
dxu2mo29-6

takes a deep breath

(Epistemic status: vague, ill-formed first impressions.)

So that's what we're doing, huh? I suppose EY/MIRI has reached the point where worrying about memetics / optics has become largely a non-concern, in favor of BROADCASTING TO THE WORLD JUST HOW FUCKED WE ARE

I have... complicated thoughts about this. My object-level read of the likely consequences is that I have no idea what the object-level consequences are likely to be, other than that this basically seems to be an attempt at heaving a gigantic rock through the Overton window, for g... (read more)

I think this is probably right. When all hope is gone, try just telling people the truth and see what happens. I don't expect it will work, I don't expect Eliezer expects it to work, but it may be our last chance to stop it.

This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space.

That's not how I read it.  To me it's an attempt at the simple, obvious strategy of telling people ~all the truth he can about a subject they care a lot about and where he and they have common interests.  This doesn't seem like an attempt to be clever or explore high-variance tails.  More like an attempt to explore the obvious strategy, or to follow the obvious bits of common-sense ethics, now that lots of allegedly clever 4-dimensional chess has turned out stupid.

I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.

Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen he... (read more)

Typo:

For example, if an alien tries to sell a basket "Alice loses $1, Bob gains $3", then the market will refuse (because Alice will refuse); and if the alien then switches to selling "Alice gains $3, Alice loses $1" then the market will refuse (because Bob will refuse); but now a certain gain has been passed over.

2So8res2mo
(fixed, thanks)

Yeah, thanks for engaging with me! You've definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don't have fully put-together thoughts on that yet.)

Hence my point about poetry - combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don't have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.

There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I poi... (read more)

Have it been quantitatively argued somewhere at all why such naturalness matters?

I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it's literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of "privileged" abstractions.

In this frame, misgeneralizatio... (read more)

1Signer2mo
Hence my point about poetry - combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don't have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster. These biases are quite robust to perturbations, so they can't be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working. Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking "well, how I'm going to explain this to operators?". Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what's the point?

Thanks again for responding! My response here is going to be out-of-order w.r.t. your comment, as I think the middle part here is actually the critical bit:

I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception even

... (read more)

I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for va... (read more)

3Signer2mo
Have it been quantitatively argued somewhere at all why such naturalness matters? Like, it's conceivable that "avoid deception" is harder to train, but why so much harder that we can't overcome this with training data bias or something? Because it does work in humans. And "invent nanotech" or "write poetry" are also small targets and training works for them.

Nice, thanks! (Upvoted.)

So, when I try to translate this line of thinking into the context of deception (or other instrumentally undesirable behaviors), I notice that I mostly can't tell what "touching the hot stove" ends up corresponding to. This might seem like a nitpick, but I think it's actually quite a crucial distinction: by substituting a complex phenomenon like deceptive (manipulative) behavior for a simpler (approximately atomic) action like "touching a hot stove", I think your analogy has elided some important complexities that arise specifically... (read more)

it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.

Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desirin... (read more)

This is ignoring the fact that you're highly skilled at deluding and confusing your audience into thinking that what the original author wrote was X, when they actually wrote a much less stupid or much less bad Y.

This does not seem like it should be possible for arbitrary X and Y, and so if Zack manages to pull it off in some cases, it seems likely that those cases are precisely those in which the original post's claims were somewhat fuzzy or ill-characterized—

(not necessarily through the fault of the author! perhaps the subject matter itself is simply fuz... (read more)

9Zack_M_Davis2mo
(Considering the general problem of how forum moderation should work, rather than my specific guilt or innocence in the dispute at hand) I think positing non-truth-tracking motivations (which can be more general than "malice or antipathy") makes sense, and that there is a real problem here: namely, that what I called "the culture of unilateral criticism and many-to-many discourse" in the great-grandparent grants a structural advantage to people who have more time to burn arguing on the internet, analogously to how adversarial court systems grant a structural advantage to litigants who can afford a better lawyer. Unfortunately, I just don't see any solutions to this problem that don't themselves have much more serious problems? Realistically, I think just letting the debate or trial process play out (including the motivated efforts of slick commenters or lawyers) results in better shared maps than trusting a benevolent moderator or judge to decide who deserves to speak. To the extent that Less Wrong has the potential to do better than other forums, I think it's because our culture and userbase is analogous to a court with a savvier, more intelligent jury (that requires lawyers to make solid arguments, rather than just appealing to their prejudices), not because we've moved beyond the need for non-collaborative debate (even though idealized Bayesian reasoners would not need to debate).
2[DEACTIVATED] Duncan Sabien2mo
(It's not a hypothesis; Zack makes his antipathy in these cases fairly explicit, e.g. "this is the egregore I'm fighting against tooth and nail" or similar. Generally speaking, I have not found Zack's writing to be confusion-inducing when it's not coming from his being triggered or angry or defensive or what-have-you.)

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Yeah, so this is the part that I (even on my actual model) find implausible (to say nothing of my Nate/Eliezer/MIRI models, which basically scoff and say accusatory things about anthropomorphism here). I think what would really help me understand this is a concrete s... (read more)

8Steven Byrnes2mo
Sure. Let’s assume an AI that uses model-based RL of a similar flavor as (I believe) is used in human brains. Step 1: The thought “I am touching the hot stove” becomes aversive because it's what I was thinking when I touched the hot stove, which caused pain etc. For details see my discussion of “credit assignment” here [https://www.lesswrong.com/posts/vpdJz4k5BgGzuGo7A/intro-to-brain-like-agi-safety-9-takeaways-from-neuro-2-2-on#9_3__Credit_assignment__is_how_latent_variables_get_painted_with_valence]. Step 2A: The thought “I desire to touch the hot stove” also becomes aversive because of its intimate connection to “I am touching the hot stove” from Step 1 above—i.e., in reality, if I desire to touch the hot stove, then it’s much more likely that I will in fact touch the hot stove, and my internal models are sophisticated enough to have picked up on that fact. Mechanistically, this can happen in either the “forward direction” (when I think “I desire to touch the hot stove”, my brain’s internal models explore the consequences, and that weakly activates “I am touching the hot stove” neurons, which in turn trigger aversion), or in the “backward direction” (involving credit assignment & TD learning, see the diagram about habit-formation here [https://www.lesswrong.com/posts/vpdJz4k5BgGzuGo7A/intro-to-brain-like-agi-safety-9-takeaways-from-neuro-2-2-on#9_2_2_Instrumental___final_preferences_seem_to_be_mixed_together]). Anyway, if the thought “I desire to touch the hot stove” indeed becomes aversive, that’s a kind of meta-preference—a desire not to have a desire. Step 2B: Conversely and simultaneously, the thought “I desire to not touch the hot stove” becomes appealing for basically similar reasons, and that’s another meta-preference (desire to have a desire). Happy to discuss more details; see also §10.5.4 here [https://www.lesswrong.com/posts/wucncPjud27mLWZzQ/intro-to-brain-like-agi-safety-10-the-alignment-problem#10_5_4_Manipulating_itself_and_its_learning_proce

The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want

I don't see how the bolded follows from the unbolded, sorry. Could you explain in more detail how you reached this conclusion?

1Noosphere892mo
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us. One major point I did elide is the amount of power seeking involved, since in the niceness goal, there's almost no power seeking involved, unlike the existential risk concerns we have. But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information. In essence, it's at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it's misaligned with human interests.

I also agree that the comment came across as rude. I mostly give Eliezer a pass for this kind of rudeness because he's wound up in the genuinely awkward position of being a well-known intellectual figure (at least in these circles), which creates a natural asymmetry between him and (most of) his critics.

I'm open to being convinced that I'm making a mistake here, but at present my view is that comments primarily concerning how Eliezer's response tugs at the social fabric (including the upthread reply from iceman) are generally unproductive.

(Quentin, to his ... (read more)

2lc2mo
That's reasonable and I generally agree. I'm not sure what to think about Eliezer's comment atm except that it upsets me when it maybe shouldn't, and that I also understand the awkward position he's in. I definitely don't want to derail the discussion, here.

The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.

I think that this is a statement Eliezer does not believe is true, and which the conversations in the MIRI conversations sequence failed to convince him of. Which is the point: since Eliezer has already engaged in extensive back-and-forth with critics of his broad view (including the likes of Paul Christiano, Richard Ngo, Rohin Shah, etc), there is actually not much continued expected update to be found in engaging with someone else who posts a criticism of his view. Do you think otherwise?

3Noosphere892mo
What I was talking about is that Eliezer (And arguably the entire MIRI-sphere) ignored evidence that AI safety could actually work and doesn't need entirely new paradigms, and one of the best examples of empirical work is the Pretraining from Human Feedback. The big improvements compared to other methods are: 1. It can avoid deceptive alignment because it gives a simple goal that's myopic, completely negating the incentives for deceptively aligned AI. 2. It cannot affect the distribution it's trained on, since it's purely offline learning, meaning we can enforce an IID assumption, and enforce a Cartesian boundary, completely avoiding embedded agency. It cannot hack the distribution it has, unlike online learning, meaning it can't unboundedly Goodhart the values we instill. 3. Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it's alignment. 4. The goal found has a small capabilities tax. There's a post on it I'll link here: https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences [https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences] Now I don't blame Eliezer for ignoring this piece specifically too much, as I think it didn't attract much attention. But the reason I'm mentioning this is that this is evidence against the worldview of Eliezer and a lot of pessimists who believe empirical evidence doesn't work for the alignment field, and Eliezer and a lot of pessimists seem to systematically ignore evidence that harms their case.

By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor

Yeah, I think Nate doesn't buy this (even for much more recent systems such as GPT-3.5/GPT-4, much less GPT-2). To the extent that [my model of] Nate thinks that LLMs/LLM-descended models can do useful ("needle-moving") alignment research, he expects those models to also be dangerous (hence the talk of "conditioning on"); but [my model of] Nate mostly denies the antecedent. Being willing to explore counte... (read more)

dxu3moΩ120

Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:

You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways

[...]

If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).

It certainly seems to me that e.... (read more)

9HoldenKarnofsky2mo
I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection"). I think there's some validity to worrying about a future with very different values from today's. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or "bad" ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to "misaligned AI" levels of value divergence than to "ems" levels of value divergence.
1Noosphere893mo
My view writ moral reflection leading to things we perceive as bad I suspect ultimately comes down to the fact that there are too many valid answers to the question "What's moral/ethical?" or "What's the CEV?" Indeed, I think there are an infinite number of valid answers to these questions. This leads to several issues for alignment: 1. Your endpoint in reflection completely depends on your starting assumptions, and these assumptions are choosable. 2. There is no safeguard against someone reflecting and ending up in a point where they harm someone else's values. Thus, seemingly bad values from our perspective can't be guaranteed to be avoided. 3. The endpoints aren't constrained by default, thus you have to hope that the reflection process doesn't lead to your values being lessened or violated.

Nate’s take on this section: “I think my current take is: some of the disagreement is in what sort of research output is indicative of needle-moving capability, and historically lots of people have hope about lots of putative alignment work that I think is obviously hopeless, so I'm maybe less optimistic than Holden here about getting a clear signal. But I could imagine there being clear signals in this general neighborhood, and I think it's good to be as explicit as this section is."

Oh, and also: this response from Nate feels weird to me for reasons that I currently seem to lack the enthusiasm/energy/"spoons" to explicate. Leaving this comment as a placeholder to come back to.

Note that I was able to reproduce this result with ChatGPT (not Plus, to be clear) without too much trouble. So at least in this case, I don't think this is an example of something beyond GPT-3.5—which is good, because writing slightly modified quines like this isn't something I would have expected GPT-3.5 to have trouble with!

(When I say "without too much trouble", I specifically mean that ChatGPT's initial response used the open(sys.argv[0]) method to access the file's source code, despite my initial request to avoid this kind of approach. But when I poi... (read more)

dxu3mo2415

So the place that my brain reports it gets its own confidence from, is from having done exercises that amount to self-play in the game I mentioned in a thread a little while back, which gives me a variety of intuitions about the rows in your table (where I'm like "doing science well requires CIS-ish stuff" and "the sort of corrigibility you learn in training doesn't generalize how we want, b/c of the interactions w/ the CIS-ish stuff")

(that plus the way that people who hope the game goes the other way, seem to generally be arguing not from the ability to e

... (read more)

I think it’s a lot more reasonable than coherence-theorem-related arguments that had previously been filling a similar slot for me

I'm confused by this sentence. It seems to me that the hypothetical example (and game) proposed by Nate is effectively a concretized way of intuition-pumping the work that coherence theorems (abstractly) describe? I.e. for any system that a coherence theorem says anything about, it will necessarily be the case that as you look at that specific system's development more closely, you will find yourself making strange and surprisin... (read more)

6David Johnston3mo
What coherence theorem do you have in mind that has these implications? For that matter, what implications are you referring to?
dxu3mo3027

Generic (but strong) upvote for more public cruxing (ish) discussions between MIRI and outsiders!

dxu3moΩ120

If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.

It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.

but, bu... (read more)

3Rohin Shah3mo
Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying. (Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty handwavy and speculative argument".) -------------------------------------------------------------------------------- Ultimately I think you just want to compare various kinds of models and ask how likely they are to arise (assuming you are staying within the scaled up neural networks as AGI paradigm). Some models you could consider: 1. The idealized aligned model, which does whatever it thinks is best for humans 2. The savvy aligned model, which wants to help humans but knows that it should play into human biases (e.g. by being sycophantic) in order to get high reward and not be selected against by gradient descent 3. The deceptively aligned model, which wants some misaligned goal (say paperclips), but knows that it should behave well until it can execute a treacherous turn 4. The bag of heuristics model, which (like a human) has a mix of various motivations, and mostly executes past strategies that have worked out well, imitating many of them from broader culture, without a great understanding of why they work, which tends to lead to high reward without extreme consequentialism. (Really I think everything is going to be (4) until significantly past human-level, but will be on a spectrum of how close they are to (2) or (3).) Plausibly you don't get (1) because it doesn't get particularly high reward relative to the others. But (2), (3) and (4) all seem like they could
Load More