I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer's cognition. I think this disagreement (which I internally feel like I've already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
...As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/o
Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
...yes? And this is obviously very, very different from how humans represent things internally?
I mean, for ...
Yeah, I'm growing increasingly confident that we're talking about different things. I'm not referring to about "masks" in the sense that you mean it.
...I don't know what you mean by "one" or by "inner". I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (ag
I want to revisit what Rob actually wrote:
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability.
(emphasis mine)
That sounds a whole lot like it's invoking a simplicity prior to me!
Note I didn't actually reply to that quote. Sure that's an explicit simplicity prior. However there's a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.
This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also "learn from basically the same data" (sensory data produced by the physical universe) with "similar training objectives" (predict the next bit of sensory information) using "universal approximations of Bayesian inference" (a perfect approximation, in this cas...
Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
...you need to first investigate the actual internal representations of the systems in question, and verify that
E.g. a system capable of correctly answering questions like "given such-and-such chess position, what is the best move for the current player?" must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
Yes, but that sort of question is in my view answered by the "mask", not by something outside the mask.
I don't think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from "the mask" or no...
In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.
Isn't it enough that they do differ? Why do we need to be able to accurately/precisely characterize the nature of the difference, to conclude that an arbitrary inductive bias different from our own is unlikely to sample the same kinds of plans we do?
That's not at all clear to me. Inductive biases clearly differ between humans, yet we are not all terminally misaligned with each other. E.g., split brain patients are not all wired value aliens, despite a significant difference in architecture. Also, training on human-originated data causes networks to learn human-like inductive biases (at least somewhat).
It would conflict with a deceptive awake Shoggoth, but IMO such a thing is unlikely because the model is super-well optimized for next token prediction
Yeah, so I think I concretely disagree with this. I don't think being "super-well optimized" for a general task like sequence prediction (and what does it mean to be "super-well optimized" anyway, as opposed to "badly optimized" or some such?) means that inner optimizers fail to arise in the limit of sufficient ability, or that said inner optimizers will be aligned on the outer goal of sequence prediction...
I think I'm having some trouble parsing this, but not in a way that necessarily suggests your ideas are incoherent and/or bad—simply that your (self-admittedly) unusual communication style is making it hard for me to understand what you are saying.
It's possible you wrote this post the way you did because this is the way the ideas in question were natively represented in your brain, and translating them out of that representation and into something more third-party legible would have been effortful and/or infeasible. If so, there's plausibly not much to be ...
With the caveat that I think this sort of “litigation of minutiae of nuance” is of very limited utility
Yeah, I think I probably agree.
would you consider “you A’d someone as a consequence of their B’ing” different from both the other two forms? Synonymous with them both? Synonymous with one but not the other?
Synonymous as far as I can tell. (If there's an actual distinction in your view, which you're currently trying to lead me to via some kind of roundabout, Socratic pathway, I'd appreciate skipping to the part where you just tell me what you think the distinction is.)
As a single point of evidence: it's immediately obvious to me what the difference is between "X is true" and "I think X" (for starters, note that these two sentences have different subjects, with the former's subject being "X" and the latter's being "I"). On the other hand, "you A'd someone due to their B'ing" and "you A'd someone for B'ing" do, actually, sound synonymous to me—and although I'm open to the idea that there's a distinction I'm missing here (just as there might be people to whom the first distinction is invisible), from where I currently stan...
If so, I find this reasoning unconvincing
Why?
I mostly don't agree that "the pattern is clear"—which is to say, I do take issue with saying "we do not need to imagine counterfactuals". Here is (to my mind) a salient example of a top-level comment which provides an example illustrating the point of the OP, without the need for prompting.
I think this is mostly what happens, in the absence of such prompting: if someone thinks of a useful example, they can provide it in the comments (and accrue social credit/karma for their contribution, if indeed other...
This, however, assumes that “formative evaluations” must be complete works by single contributors, rather than collaborative efforts contributed to by multiple commenters. That is an unrealistic and unproductive assumption, and will lead to less evaluative work being done overall, not more.
I am curious as to your assessment of the degree of work done by a naked "this seems unclear, please explain"?
My own assessment would place the value of this (and nothing else) at fairly close to zero—unless, of course, you are implicitly taking credit for some of the...
You continue to assert things without justification, which is fine insofar as your goal is not to persuade others. And perhaps this isn't your goal! Perhaps your goal is merely to make it clear what your beliefs are, without necessarily providing the reasoning/evidence/argumentation that would convince a neutral observer to believe the same things you do.
But in that case, you are not, in fact, licensed to act surprised, and to call others "irrational", if they fail to update to your position after merely seeing it stated. You haven't actually given anyone ...
...
- one is straightforwardly true. Aging is going to kill every living creature. Aging is caused by complex interactions between biological systems and bad evolved code. An agent able to analyze thousands of simultaneous interactions, cross millions of patients, and essentially decompile the bad code (by modeling all proteins/ all binding sites in a living human) is likely required to shut it off, but it is highly likely with such an agent and with such tools you can in fact save most patients from aging. A system with enough capabilities to consider all
Categories like “conflicts of interest”, “discussions about who should be banned”, “arguments about moderation in cases in which you’re involved”, etc., already constitute “evidence” that push the conclusion away from the prior of “on the whole, people are more likely to say true things than false things”, without even getting into anything more specific.
The strength of the evidence is, in fact, a relevant input. And of the evidential strength conferred by the style of reasoning employed here, much has already been written.
...You’ve misunderstood. My poi
I'm not sure what predictions you're making that are different than mine, other than maybe "a research program that skips NN's and just try to build the representations that they build up directly without looking at NNs has reasonable chances of success." Which doesn't seem like one you'd actually want to make.
I think I would, actually, want to make this prediction. The problem is that I'd want to make it primarily in the counterfactual world where the NN approach had been abandoned and/or declared off-limits, since in any world where both approaches ex...
This is a claim so general as to be meaningless. If we knew absolutely nothing except “a person said a thing”, then retreating to this sort of maximally-vague prior might be relevant. But we in fact are discussing a quite specific situation, with quite specific particular and categorical features. There is no good reason to believe that the quoted prior survives that descent to specificity unscathed (and indeed it seems clear to me that it very much does not).
The prior does in fact survive, in the absence of evidence that pushes one's conclusion away fr...
Your link looks broken; here's a working version.
(Note: your formatting looks correct to me, so I suspect the issue is that you're not using the Markdown version of the LW editor. If so, you can switch to that using the dropdown menu directly below the text input box.)
I think diverting people to a real-time discussion location like Discord could be more effective.
Agreed—which raises to mind the following question: does LW currently have anything like an official/primary public chatroom (whether hosted on Discord or elsewhere)? If not, it may be worth creating one, announcing it in a post (for visibility), and maintaining a prominently visible link to it on e.g. the sidebar (which is what many subreddits do).
Do you have preferred arguments (or links to preferred arguments) for/against these claims? From where I stand:
Point 1 looks to be less a positive claim and more a policy criticism (for which I'd need to know what specifically you dislike about the policy in question to respond in more depth), points 2 and 3 are straightforwardly true statements on my model (albeit I'd somewhat weaken my phrasing of point 3; I don't necessarily think agency is "automatic", although I do consider it quite likely to arise by default), point 4 seems likewise true, because the...
For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can't even predict basic things like "when will this system have situational awareness", etc.
Yes, and this can be framed as a consequence of a more general principle, which is that model uncertainty doesn't save you from pessimistic outcomes unless your prior (which after all is what you fall back t...
I would be interested in helping out with a newbie comment queue to keep it moving quickly so that newbies can have a positive early experience on lesswrong, whereas I would not want to volunteer for the "real" mod team because I don't have the requisite time and skills for reliably showing up for the more nuanced aspects of the role.
Were such a proposal to be adopted, I would be likewise willing to participate.
The sequence starting with this post seemed to me at the time I read it to be a good summary of reasons to reject "Knightian" uncertainty as somehow special, and it continues to seem that way as of today.
Note that Richard is not treating knightian uncertainty as special and unquantifiable, but instead is giving examples of how to treat it like any other uncertainty, that he is explicitly quantifying and incorporating in his predictions.
I'd prefer calling Richard's "model error" to separate the two, but I'm also okay appropriating the term as Richard did to point to something coherent.
interpretability didn't progress at all, or that we know nothing about AI internals at all
No to the former, yes to the latter—which is noteworthy because Eliezer only claimed the latter. That's not a knock on interpretability research, when in fact Eliezer has repeatedly and publicly praised e.g. the work of Chis Olah and Distill. The choice to interpret the claim that we "know nothing about AI internals" as the claim that "no interpretability work has been done", it should be pointed out, was a reading imposed by ShardPhoenix (and subsequently by you)....
Eliezer can't update well on evidence at all, especially if it contradicts doom (in this case it's not too much evidence against doom, but calling it zero evidence is inaccurate.)
I've noticed you repeating this claim in a number of threads, but I don't think I've seen you present evidence sufficient to justify it. In particular, the last time I asked you about this, your response was basically premised on "I think current (weak) systems are going to analogize very well to stronger systems, and this analogy carries the weight of my entire argument."
But i...
takes a deep breath
(Epistemic status: vague, ill-formed first impressions.)
So that's what we're doing, huh? I suppose EY/MIRI has reached the point where worrying about memetics / optics has become largely a non-concern, in favor of BROADCASTING TO THE WORLD JUST HOW FUCKED WE ARE
I have... complicated thoughts about this. My object-level read of the likely consequences is that I have no idea what the object-level consequences are likely to be, other than that this basically seems to be an attempt at heaving a gigantic rock through the Overton window, for g...
I think this is probably right. When all hope is gone, try just telling people the truth and see what happens. I don't expect it will work, I don't expect Eliezer expects it to work, but it may be our last chance to stop it.
This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space.
That's not how I read it. To me it's an attempt at the simple, obvious strategy of telling people ~all the truth he can about a subject they care a lot about and where he and they have common interests. This doesn't seem like an attempt to be clever or explore high-variance tails. More like an attempt to explore the obvious strategy, or to follow the obvious bits of common-sense ethics, now that lots of allegedly clever 4-dimensional chess has turned out stupid.
I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.
Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen he...
Typo:
For example, if an alien tries to sell a basket "Alice loses $1, Bob gains $3", then the market will refuse (because Alice will refuse); and if the alien then switches to selling "Alice gains $3, Alice loses $1" then the market will refuse (because Bob will refuse); but now a certain gain has been passed over.
Yeah, thanks for engaging with me! You've definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don't have fully put-together thoughts on that yet.)
Hence my point about poetry - combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don't have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I poi...
Have it been quantitatively argued somewhere at all why such naturalness matters?
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it's literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of "privileged" abstractions.
In this frame, misgeneralizatio...
Thanks again for responding! My response here is going to be out-of-order w.r.t. your comment, as I think the middle part here is actually the critical bit:
...I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception even
I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for va...
Nice, thanks! (Upvoted.)
So, when I try to translate this line of thinking into the context of deception (or other instrumentally undesirable behaviors), I notice that I mostly can't tell what "touching the hot stove" ends up corresponding to. This might seem like a nitpick, but I think it's actually quite a crucial distinction: by substituting a complex phenomenon like deceptive (manipulative) behavior for a simpler (approximately atomic) action like "touching a hot stove", I think your analogy has elided some important complexities that arise specifically...
it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desirin...
This is ignoring the fact that you're highly skilled at deluding and confusing your audience into thinking that what the original author wrote was X, when they actually wrote a much less stupid or much less bad Y.
This does not seem like it should be possible for arbitrary X and Y, and so if Zack manages to pull it off in some cases, it seems likely that those cases are precisely those in which the original post's claims were somewhat fuzzy or ill-characterized—
(not necessarily through the fault of the author! perhaps the subject matter itself is simply fuz...
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
Yeah, so this is the part that I (even on my actual model) find implausible (to say nothing of my Nate/Eliezer/MIRI models, which basically scoff and say accusatory things about anthropomorphism here). I think what would really help me understand this is a concrete s...
The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want
I don't see how the bolded follows from the unbolded, sorry. Could you explain in more detail how you reached this conclusion?
I also agree that the comment came across as rude. I mostly give Eliezer a pass for this kind of rudeness because he's wound up in the genuinely awkward position of being a well-known intellectual figure (at least in these circles), which creates a natural asymmetry between him and (most of) his critics.
I'm open to being convinced that I'm making a mistake here, but at present my view is that comments primarily concerning how Eliezer's response tugs at the social fabric (including the upthread reply from iceman) are generally unproductive.
(Quentin, to his ...
The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.
I think that this is a statement Eliezer does not believe is true, and which the conversations in the MIRI conversations sequence failed to convince him of. Which is the point: since Eliezer has already engaged in extensive back-and-forth with critics of his broad view (including the likes of Paul Christiano, Richard Ngo, Rohin Shah, etc), there is actually not much continued expected update to be found in engaging with someone else who posts a criticism of his view. Do you think otherwise?
By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor
Yeah, I think Nate doesn't buy this (even for much more recent systems such as GPT-3.5/GPT-4, much less GPT-2). To the extent that [my model of] Nate thinks that LLMs/LLM-descended models can do useful ("needle-moving") alignment research, he expects those models to also be dangerous (hence the talk of "conditioning on"); but [my model of] Nate mostly denies the antecedent. Being willing to explore counte...
Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:
You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways
[...]
If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).
It certainly seems to me that e....
Nate’s take on this section: “I think my current take is: some of the disagreement is in what sort of research output is indicative of needle-moving capability, and historically lots of people have hope about lots of putative alignment work that I think is obviously hopeless, so I'm maybe less optimistic than Holden here about getting a clear signal. But I could imagine there being clear signals in this general neighborhood, and I think it's good to be as explicit as this section is."
Oh, and also: this response from Nate feels weird to me for reasons that I currently seem to lack the enthusiasm/energy/"spoons" to explicate. Leaving this comment as a placeholder to come back to.
Note that I was able to reproduce this result with ChatGPT (not Plus, to be clear) without too much trouble. So at least in this case, I don't think this is an example of something beyond GPT-3.5—which is good, because writing slightly modified quines like this isn't something I would have expected GPT-3.5 to have trouble with!
(When I say "without too much trouble", I specifically mean that ChatGPT's initial response used the open(sys.argv[0])
method to access the file's source code, despite my initial request to avoid this kind of approach. But when I poi...
...So the place that my brain reports it gets its own confidence from, is from having done exercises that amount to self-play in the game I mentioned in a thread a little while back, which gives me a variety of intuitions about the rows in your table (where I'm like "doing science well requires CIS-ish stuff" and "the sort of corrigibility you learn in training doesn't generalize how we want, b/c of the interactions w/ the CIS-ish stuff")
(that plus the way that people who hope the game goes the other way, seem to generally be arguing not from the ability to e
I think it’s a lot more reasonable than coherence-theorem-related arguments that had previously been filling a similar slot for me
I'm confused by this sentence. It seems to me that the hypothetical example (and game) proposed by Nate is effectively a concretized way of intuition-pumping the work that coherence theorems (abstractly) describe? I.e. for any system that a coherence theorem says anything about, it will necessarily be the case that as you look at that specific system's development more closely, you will find yourself making strange and surprisin...
Generic (but strong) upvote for more public cruxing (ish) discussions between MIRI and outsiders!
If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.
but, bu...
RE: decision theory w.r.t how "other powerful beings" might respond - I really do think Nate has already argued this, and his arguments continue to seem more compelling to me than the the opposition's. Relevant quotes include:
... (read more)