All of Richard_Ngo's Comments + Replies

Whenever people are sad for any reason except s-risk, I wonder if they're able to think at all about important issues. /s

I half-agree with both of you. I do think Hanson's selection pressure paper is a useful first approximation, but it's not clear that the reachable universe is big enough that small deviations from the optimal strategy will actually lead to big differences in amount of resources controlled. And as I gestured towards in the final section of the story, "helping" can be very cheap, if it just involves storing their mind until you've finished expanding.

But I don't think that the example of animals demonstrates this point very well, for two reasons. Firstly, in ... (read more)

Yeah, I moved it to earlier than it was, for two reasons. Firstly, if the grasshopper was just unlucky, then there's no "deviation" to forgive—it makes sense only if the grasshopper was culpable. Secondly, the earlier parts are about individuals, and the latter parts are about systems—it felt more compelling to go straight from "centralized government" to "locust war" than going via an individual act of kindness.

Curious what you found more meaningful about the original placement?

5Raemon11d
The stage of moral grieving I’m personally at is more at the systems stage, and I’m still feeling a bit lost and confused about it. I felt like I actually learned a thing from the reminder ‘oh, we can still just surreptitiously forgive the sinner via individual discretion despite needing to build the system fairly rigidly.’ Also I did recognize the reference to speaker for the dead, and the combination of ‘new satisfying moral click’ alongside a memory of when a simpler application of the Orson Scott Card quote was very satisfying.

Ty, nice to hear! Have edited slightly for clarity, as per Mako's comment.

7Raemon11d
To clarify somewhat, my confusion was of my own internal moral orienting. This parable hints at a bunch of tradeoffs that maybe correspond to something like "moral developmental stages" along a particular axis, and I'm palpably only partway through the process and still feel confused about it. I plan to write a up a response post that goes into more detail.

I intended to convey it via "The grasshopper’s mind is ... waiting to be born again in a fragment of a fragment of a supercomputer made of stars", but there's a lot in between those two phrases so it's reasonable to miss that implication.

Have edited to fix.

Artifact of cross-posting from my blog.

1parafactual12d
I assumed, but I'm curious as to what the artifact was specifically.

My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").

I expect that you personally won't do a motte-and-bailey here (except perhaps insofar as you later draw on posts like these as evidence that the doomer view has been laid out in a lot of dif... (read more)

When I say "repudiate" I mean a combination of publicly disagreeing + distancing. I presume you agree that this is suboptimal for both of us, and my comment above is an attempt to find a trade that avoids this suboptimal outcome.

Note that I'm fine to be in coalitions with people when I think their epistemologies have problems, as long as their strategies are not sensitively dependent on those problems. (E.g. presumably some of the signatories of the recent CAIS statement are theists, and I'm fine with that as long as they don't start making arguments that ... (read more)

If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.

In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.

And I don't think there's any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they're just... (read more)

If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.

In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.

And I don't think there's any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they're just

... (read more)

Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I'm associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone.

For what it's worth, I think you should just say that you disagree with it? I don't really understand why this would be a "bad outcome for everyone". Just list out th... (read more)

Mmm, I still prefer trust I think. Spaciousness gives me connotations of... well, distance, and separation. In some sense my relationship with almost everyone in the world is spacious. The thing that's special about some relationships is that they have both spaciousness and intensity, which to me feels well-described by "trust".

It seems to me that many of my disagreements with others in this space come from them hearing me say "I want the AI to like vanilla ice cream, as I do", whereas I hear them say "the AI will automatically come to like the specific and narrow thing (broad cosmopolitan value) that I like".


At the moment I'm just trying to state my position, in the hopes that this helps us skip over the step where people think I'm arguing for carbon chauvanism.

I think posts like these would benefit a lot from even a little bit of context, such as:

  • Who you've been arguing with
  • Who
... (read more)

feels like it's setting up weak-men on an issue where I disagree with you, but in a way that's particularly hard to engage with

My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").

This misunderstands me (as is a separate claim from the clai... (read more)

I think that there are many answers along these lines (like "I'm not talking about a whole value system, I'm talking about a deontological constraint") which would have been fine here.

The issue was that sentences like "It's a boundary concept (element of a deontological agent design), not a value system (in the sense of preference such as expected utility, a key ingredient of an optimizer)" use the phrasing of someone pointing to a well-known, clearly-defined concept, but then only link to Critch's high-level metaphor.

2Raemon16d
Okay, I get where you're coming from now. Will have to mull over whether I agree but I am at least no longer feel confused about what the disagreement is about now.

I personally think it's important to separate philosophical speculation from well-developed rigorous work, and Critch's stuff on boundaries seems to land well in the former category.

This is a communicative norm not an epistemic norm—you're welcome to believe whatever you like about Critch's stuff, but when you cite it as if it's widely-understood (across the LW community, or elsewhere) to be a credible, well-developed idea, then this undermines our ability to convey the ideas that are widely-understood to be credible.

2[comment deleted]17d
0TAG17d
Yes, but of course Critch is the tip of a rather large iceberg. Rationalists tend to think you should familiarise yourself with a mass of ideas virtually none of which have been rigourously proven.
5Vladimir_Nesov18d
Sure. I don't think I did though? My use of "reference" [https://www.lesswrong.com/posts/Htu55gzoiYHS6TREB/sentience-matters?commentId=trFdkwPtk6QazJ5Wo] was merely in the sense of explaining the intended meaning of the word "boundary" I used in the top level comment, so it's mostly about definitions and context of what I was saying. (I did assume that the reference would plausibly be understood, and I linked to a post [https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcy] on the topic right there in the original comment [https://www.lesswrong.com/posts/Htu55gzoiYHS6TREB/sentience-matters?commentId=ab94QHwzAaDmrvEXK] to gesture at the intended sense and context of the word. There's also been a post [https://www.lesswrong.com/posts/fDk9hLDpjeT9gZH6h/membranes-is-better-terminology-than-boundaries-alone] on the meaning of this very word just yesterday.) And then M. Y. Zuo started talking about credibility, which still leaves me confused about what's going on, despite some clarifying back and forth.

I think there's a bunch of useful stuff in this post, and am generally very excited about having more cybersecurity experts working on AI safety. Having said that, it feels like a bit of a jump to say that LW (or AI safety overall) should become a hacker community, which would come with a lot of tradeoffs; and I think that this part detracts from the post overall.

I actually thought from the title that you meant "hacker community" as in "getting hands-on with AI, implementing lots of AI stuff" (i.e. hacker in the sense of hackathon). That feels more directl... (read more)

This post has the fewest upvotes of any post in the sequence by a long way, so I'm interested in revising it based on feedback. It'd be useful to hear what people disliked about it, or improvements you'd suggest.

Some of those links say that in more authoritarian cultures, people are considered to be trustworthy if they show respect to their superiors - which reads to me as saying that you're trusted if you show that you will obey.

Oh, that's very interesting. Yeah, this seems like it might account for the discrepancy here. But my instinct is that I want to hang on to the "trust" terminology, and just hold that authoritarian cultures have an impoverished definition of trust (compared with the one I gave earlier: "letting another agent do as they wish, without trying... (read more)

2Kaj_Sotala16d
How about "spaciousness [https://vividness.live/spacious-freedom]" (as in the relationship giving both individuals the space to move/act as they prefer) instead of freedom/trust?
-1M. Y. Zuo20d
  In that case all the major countries have 'an impoverished definition of trust' as they all operates huge amounts of classified programs where obedience to superiors is required and there's no way of disobeying without incurring secret punishment.  

Presumably you're objecting to the first part of the quoted sentence, right, not the second half? Note that I'm not taking a particular position on the extent to which it's an evolutionary versus cultural adaptation.

Could you say more about why Chagnon's research weighs against it? I had a quick read of his wikipedia page but am not clear on the connection.

1Daniel Paleka22d
So I've read an overview [https://woodfromeden.substack.com/p/violent-enough-to-stand-still] [1] which says Chagnon observed a pre-Malthusian group of people, which was kept from exponentially increasing not by scarcity of resources, but by sheer competitive violence; a totalitarian society that lives in abundance.  There seems to be an important scarcity factor shaping their society, but not of the kind where we could say that  "we only very recently left the era in which scarcity was the dominant feature of people’s lives." Although, reading again, this doesn't disprove violence in general arising due to scarcity, and then misgeneralizing in abundant environments... And again, "violence" is not the same as "coercion". 1. ^ Unnecessarily political, but seems to accurately represent Chagnon's observations, based on other reporting and a quick skim of Chagnon's work on Google Books.

I don't think I understand the principled difference between correlation and reciprocity; the latter seems like a subset of the former. Let me try say some things and see where you disagree. This is super messy and probably doesn't make sense, sorry.

  1. There are many factors which could increase the correlation between two agents' decisions. For agents that are running reciprocity-like policies, the predictions they make about other agents are a particularly big factor.
  2. In picking out reciprocity as a separate phenomenon, you seem to be saying "we can factoriz
... (read more)
2paulfchristiano24d
I think it was confusing for me to use "correlation" to refer to a particular source of correlation. I probably should have called it something like "similarity." But I think the distinction is very real and very important, and crisp enough to be a natural category. More precisely, I think that: is qualitatively and crucially different from: I don't think either one is a subset of the other. I don't think these are an exhaustive taxonomy of reasons that two people can be correlated, but I think they are the two most important ones. On its own I don't see why this would lead me to be kind (if I generally deal with kind people, why does that mean I should be kind?) I think you have to fill in the remaining details somehow, e.g.: maybe I dealt with people who are kind if and only if X is true, and so I have learned to be kind when X is true. In my taxonomy this is a central example of reciprocity---the correlation flows through a pressure for me to make predictions about when you will be kind, and then be kind when I think that you will be kind, rather than from us using similar procedures to make decisions. I don't think I would call any version of this story "correlation" (the concept I should have called "similarity").

Curious if you feel like the advice I gave would have also helped:

Having said that, self-leadership doesn’t mean never getting angry—it just means never fully giving in to that anger or wielding it with the goal of hurting another person (or another part of yourself). Self-leadership might involve telling the other person that you feel angry at them, but without launching into a tirade; or telling them that you need to go on a walk to calm down, but giving them a reassuring gesture before you leave. In other words, self-leadership means that whil

... (read more)
2cousin_it24d
Yeah, I think this is right.

I've had a nagging feeling in the past that the rationalist community isn't careful enough about the incentive problems and conflicts of interest that arise when transferring reasonably large sums of money (despite being very careful about incentive landscapes in other ways—e.g. setting the incentives right for people to post, comment, etc, on LW—and also being fairly scrupulous in general). Most of the other examples I've seen have been kinda small-scale and so I haven't really poked at them, but this proposal seems like it pretty clearly sets up terrible... (read more)

I think this is a really cool idea. But the example at the end feels pretty uncompelling (both the critique and the compliment). I expect I'd link the post to more people if you swapped it for a more straightforward one.

7RamblinDash1mo
I had this thought too but there's kind of a problem, which is that the more compelling the example of "tall poppy", the more politically controversial which can distract from and undermine your message. I kinda think Elon Musk is the perfect example to use though. I wish the post could somehow autodetect the reader's politics and select statements about Elon accordingly. "Elon Musk [lately seems to be going off the antisemitism deep end/does a lot of securities fraud/comes up with dumb fake ideas like Hyperloop/calls people pedos for no reason/exaggerates how good Tesla autopilot is in a way that seems likely to kill people] but I still really appreciate how he [jump-started the modern electric car industry/brought innovation back to space launches/something something Starlink].

Interesting! Hadn't thought of this approach. Let's see... Intuitively I think it gets pretty strategically weird because a) who you vote for depends pretty sensitively on other peoples' votes (e.g. in proportional chances voting you want to vote for everyone who's above the expected value of everyone else's votes; in approval voting you want to vote for everyone you approve of unless it bumps them above someone you like more), and b) you want to buy from your enemies much more than from your friends, because your friends will already not be voting for bad candidates. But maybe the latter is fine because if you buy from your friends they'll end up with more money which they can then spend on other things? I'll keep thinking.

Random question I’ve been thinking about: how would you set up a market for votes? Suppose specifically that you have a proportional chances election (i.e. the outcome gets chosen with probability proportional to the number of votes cast for it—assume each vote is a distribution over candidates). So everyone has an incentive to get everyone who’s not already voting for their favorite option to change their vote; and you can have positive-sum trades where I sell you a promise to switch X% of my votes to a compromise candidate in exchange for you switching Y... (read more)

4Measure1mo
Just spitballing here: Assign each voter 100 shares for each candidate. To vote, each voter selects a subset of their shares to constitute their vote. Voters can freely trade shares. Under this system, a voter would more highly value shares for candidates that are either very high or very low in their preference order (the later so as to exclude them from the vote). Thus, trades would look like each party exchanging shares about which they are themselves ambivalent to gain shares that are more valuable to them. If you remove the proportional chances part, then it becomes a guessing game of which marginal votes actually matter.

I just stumbled upon the Independence of Pareto dominated alternatives criterion; does the ROSE value have this property? I'm pattern-matching it as related to disagreement-point invariance, but haven't thought about this at all.

Flagging that Diffractor's work on threat-resistant bargaining feels like the most important s-risk-related work I've ever seen, but I also haven't thoroughly evaluated it so I'd love for someone to do so and write up their thoughts.

1Dawn Drescher1mo
Woah, thanks! I hadn’t seen it!

Yeah, I agree I convey the implicit prediction that, even though not all one-month tasks will fall at once, they'll be closer than you would otherwise expect not using this framework.

I think I still disagree with your point, as follows: I agree that AI will soon do passably well at summarizing 10k word books, because the task is not very "sharp" - i.e. you get gradual rather than sudden returns to skill differences. But I think it will take significantly longer for AI to beat the quality of summary produced by a median expert in 1 month, because that expert's summary will in fact explore a rich hierarchical interconnected space of concepts from the novel (novel concepts, if you will).

Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.

E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renami... (read more)

2scasper1mo
Thanks for the comment. This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.  No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.  Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks.  So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is. 

My default (very haphazard) answer: 10,000 seconds in a day; we're at 1-second AGI now; I'm speculating 1 OOM every 1.5 years, which suggests that coherence over multiple days is 6-7 years away.

The 1.5 years thing is just a very rough ballpark though, could probably be convinced to double or halve it by doing some more careful case studies.

Thanks. For the record, my position is that we won't see progress that looks like "For t-AGI, t increases by +1 OOM every X years" but rather that the rate of OOMs per year will start off slow and then accelerate. So e.g. here's what I think t will look like as a function of years:

YearRichard (?) guessDaniel guess
202315
2024515
202525100
20261002000
2027500Infinity (singularity)
20282,500 
202910,000 
203050,000 
2031250,000 
20321,000,000 

I think this partly because of the way I think generalization works (I think e.g. once AIs have gotten... (read more)

Why is it cheating? That seems like the whole point of my framework - that we're comparing what AIs can do in any amount of time to what humans can do in a bounded amount of time.

Whatever. Maybe I was just jumping on an excuse to chit-chat about possible limitations of LLMs :) And maybe I was thread-hijacking by not engaging sufficiently with your post, sorry.

This part you wrote above was the most helpful for me:

if the task is "spend a month doing novel R&D for lidar", then my framework predicts that we'll need 1-month AGI for that

I guess I just want to state my opinion that (1) summarizing a 10,000-page book is a one-month task but could come pretty soon if indeed it’s not already possible, (2) spending a month doing novel R&a... (read more)

But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way.


But can't we equivalently just ask an AI to pose a question that no human would have a prayer of answering in one second? It wouldn't even need to be a trivial memorization thing, it could also be a math problem complex enough that humans can't do it that quickly, or drawing a link between two very different domains of knowledge.

4Steven Byrnes1mo
I think the “in one second” would be cheating. The question for Ed Witten didn’t specify “the AI can’t answer it in one second”, but rather “the AI can’t answer it period”. Like, if GPT-4 can’t answer the string theory question in 5 minutes, then it probably can’t answer it in 1000 years either. (If the AI can get smarter and smarter, and figure out more and more stuff, without bound, in any domain, by just running it longer and longer [https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn?commentId=7yAJbkDtMepxDvcMe], then (1) it would be quite disanalogous to current LLMs [btw I’ve been assuming all along that this post is implicitly imagining something vaguely like current LLMs but I guess you didn’t say that explicitly], (2) I would guess that we’re already past end-of-the-world territory.)

How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?


This doesn't seem like a great metric because there are many tasks that a college grad can do with 0 training that current AI can't do, including:

  • Download and play a long video game to completion
  • Read and summarize a whole book
  • Spend a month planning an event

I do think that there's something important about this metric, but I think it's basically subsumed by my metric: if the task is "spend a month doing novel R&D for... (read more)

5Steven Byrnes1mo
Ah, that’s helpful, thanks. I think you’re saying “there are questions about string theory whose answers are obvious to Ed Witten because he happened to have thought about them in the course of some unpublished project, but these questions are hyper-specific, so bringing them up at all would be unfair cherry-picking.” But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way. And it points to an important human capability, namely, figuring out which areas are promising and tractable to explore, and then exploring them. Like, if a human wants to make money or do science or take over the world, then they get to pick, endogenously, which areas or avenues to explore.

Hmm, I'm more interested in FLOP than watts, because almost all watts can't be converted to FLOP.

Also, I think at some point there'll be a salient difference between "many FLOP/s for a short time" and "fewer FLOP/s for a long time" but right now it doesn't feel like a crucial distinction to track.

2the gears to ascension1mo
Hmm. I don't think that'll last very long. Perhaps there's no particular need to think ahead on this, though.

These are all arguments about the limit; whether or not they're relevant depends on whether they apply to the regime of "smart enough to automate alignment research".

1Joe_Collman1mo
Agreed. Are you aware of any work that attempts to answer this question? Does this work look like work on debate? (not rhetorical questions!) My guess is that work likely to address this does not look like work on debate. Therefore my current position remains: don't bother working on debate; rather work on understanding the fundamentals that might tell you when it'll break. The world won't be short of debate schemes. It'll be short of principled arguments for their safe application.

For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find an alignment solution.
2) Debate is a plausible basis for an alignment solution.

I generally don't think about things in terms of this dichotomy. To me, an "alignment solution" is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don't think you can separate these two things.

(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training... (read more)

1Joe_Collman1mo
Oh, to be clear, with "to help find" I only mean that we expect to make significant progress using debate. If we knew we'd safely make enough progress to get to a solution, then you're quite right that that would amount to (2). (apologies for lack of clarity if this was the miscommunication) That's the distinction I mean to make between (1) and (2): we need to get to the moon safely With (1) we have no idea when our rocket will explode. Similarly, we have no idea whether the moon will be far enough to know when our next rocket will explode. (not that I'm knocking robustly getting to the moon safely) If we had some principled argument telling us how far we could push debate before things became dangerous, that'd be great. I'm claiming that we have no such argument, and that all work on debate (that I'm aware of) stands near-zero chance of finding one. Of course I'm all for work "on debate" that aims at finding that kind of argument - however, I would expect that such work leaves the specifics of debate behind pretty quickly.

Yepp, agree with all that.

Quickly sketching out some of my views - deliberately quite basic because I don't typically try to generate very accurate credences for this sort of question:

  • When I think about the two tasks "solve alignment" and "take over the world", I feel pretty uncertain which one will happen first. There are a bunch of considerations weighing in each direction. On balance I think the former is easier, so let's say that conditional on one of them happening, 60% that it happens first, and 40% that the latter happens first. (This may end up depending quite sensitively o
... (read more)

I think the substance of my views can be mostly summarized as:

  • AI takeover is a real thing that could happen, not an exotic or implausible scenario.
  • By the time we build powerful AI, the world will likely be moving fast enough that a lot of stuff will happen within the next 10 years.
  • I think that the world is reasonably robust against extinction but not against takeover or other failures (for which there is no outer feedback loop keeping things on the rails).

I don't think my credences add very much except as a way of quantifying that basic stance. I largely made this post to avoid confusion after quoting a few numbers on a podcast and seeing some people misinterpret them.

I like this post! I notice the diagram doesn't really map onto a cognitive process that I consider realistic, though. So here's my attempted replacement for what 'most people' do:

  1. Does P feel tribally-loaded to you?
    1. If yes or "could be" and you're politically savvy, say whatever's most useful (level 4).
    2. If yes and you're not politically savvy, answer according to your tribal affiliation (level 3).
  2. Is P relevant to your strategic interests in non-tribally-loaded ways?
    1. If yes and you're in consequentialist mode, say whatever's most useful (level 2).
    2. If no to either, answer according to your object-level beliefs (level 1)
2Daniel Kokotajlo2mo
Yeah, good point, I agree. Should have optimized more for realism instead of satisficing once I had a good example.

My approach is to read the title, then if I like it read the first paragraph, then if I like that skim the post, then in rare cases read the post in full (all informed by karma).

I can't usually evaluate the quality of criticism without at least having skimmed the post. And once I've done that then I don't usually gain much from the criticisms (although I do agree they're sometimes useful).

I'm partly informed here by the fact that I tend to find Said's criticisms unusually non-useful.

Makes sense.

One of the things that's most cruxy to me is what people who contribute a lot of top content* feel about the broader patterns, so, I appreciate you chiming in here.


FYI I personally haven't had bad experiences with Said (and in fact I remember talking to mods who were at one point surprised by how positively he engaged with some of my posts). My main concern here is the missing stair dynamic of "predictable problem that newcomers will face".

Not responding to the main claim, cos mods have way more context on this than me, will defer to them.

 think that’s a more pessimistic view than even my own!

Very plausibly. But pessimism itself isn't bad, the question is whether it's the sort of pessimism that leads to better content or the sort that leads to worse content. Where, again, I'm going to defer to mods since they've aggregated much more data on how your commenting patterns affect people's posting patterns.

Skimmed all the comments here and wanted to throw in my 2c (while also being unlikely to substantively engage further, take that into account if you're thinking about responding):

  • It seems to me that people should spend less time litigating this particular fight and more time figuring out the net effects that Duncan and Said have on LW overall. It seems like mods may be dramatically underrating the value of their time and/or being way too procedurally careful here, and I would like to express that I'd support them saying stuff like "idk exactly what went wr
... (read more)

Wei Dai had a comment below about how important it is to know whether there’s any criticism or not, but mostly I don’t care about this either because my prior is just that it’s bad whether or not there’s criticism. In other words, I think the only good approach here is to focus on farming the rare good stuff and ignoring the bad stuff (except for the stuff that ends up way overrated, like (IMO) Babble or Simulators, which I think should be called out directly).

But how do you find the rare good stuff amidst all the bad stuff? I tend to do it with a combi... (read more)

Thanks for weighing in! Fwiw I've been skimming but not particularly focused on the litigation of the current dispute, and instead focusing on broader patterns. (I think some amount of litigation of the object level was worth doing but we're past the point where I expect marginal efforts there to help)

One of the things that's most cruxy to me is what people who contribute a lot of top content* feel about the broader patterns, so, I appreciate you chiming in here.

*roughly operationalized as "write stuff that ends up in the top 20 or top 50 of the annual review"

8Said Achmiz2mo
You know, I’ve seen this sort of characterization of my commenting activity quite a few times in these discussions, and I’ve mostly shrugged it off; but (with apologies, as I don’t mean to single you out, and indeed you’re one of the LW members whom I respect significantly more than average) I think at this point I have to take the time to address it. My objection is simply this: Is it actually true that I “comment pessimistically on lots of stuff”? Do I do this more than other people? There are many ways of operationalizing that, of course. Here’s one that seems reasonable to me: let’s find all the posts (not counting “meta”-type posts that are already about me, or referring to me, or having to do with moderation norms that affect me, etc.) on which I’ve commented “pessimistically” in, let’s say, the last six months, and see if my comments are, in their level of “pessimism”, distinguishable from those of other commenters there; and also what the results of those comments turn out to be. #1: https://www.lesswrong.com/posts/Hsix7D2rHyumLAAys/run-posts-by-orgs [https://www.lesswrong.com/posts/Hsix7D2rHyumLAAys/run-posts-by-orgs] Multiple people commenting in similarly “pessimistic” ways, including me. The most, shall we say, vigorous, discussion that takes place there doesn’t involve me at all. #2: https://www.lesswrong.com/posts/2yWnNxEPuLnujxKiW/tabooing-frame-control [https://www.lesswrong.com/posts/2yWnNxEPuLnujxKiW/tabooing-frame-control] My overall view is certainly critical, but here I write multiple medium-length comments, which contain substantive analyses of the concept being discussed. (There is, however, a very brief comment from someone else [https://www.lesswrong.com/posts/2yWnNxEPuLnujxKiW/tabooing-frame-control#iJ3TiuHgXfFQiDgFt] which is just a request—or “demand”?—for clarification; such is given, without protest.) #3: https://www.lesswrong.com/posts/67NrgoFKCWmnG3afd/you-ll-never-persuade-people-like-that [https://www.lesswrong.com/posts/67N

Curious which intuitions you think most fail to come across?

6habryka2mo
I don't have all the cognitive context booted up of what exact essays are part of AI Safety Fundamentals, so do please forgive me if something here does end up being covered and I just forgot about an important essay, but as a quick list of things that I vaguely remember missing:  * Having good intuitions for how smart a superintelligence could really be. Arguments for the lack of upper limit of intelligence.  * Having good intuitions for complexity of value. That even if you get an AI aligned with your urges and local desires, this doesn't clearly get you that far towards an AGI you would feel comfortable optimizing things on their own.  * Somehow communicating the counterintuiveness of optimization. Classical examples that have helped me are the cannibal bug examples from the sequences. The genetic algorithm that developed an antenna (the specification gaming Deepmind post never really got this across for me) * Security mindset stuff * Something about the set of central intuitions I took away from Paul's work. I.e. something in the space of "try to punt as much of the problem to systems smarter than you". * Eternity in six hours style stuff. Trying to understand the scale of the future. This has been very influential on my models of what kinds of goals an AI might have. * Civilizational inadequacy stuff. A huge component of people's differing views on what to do about AI Risk seems to be sources in disagreements on the degree to which humanity at large does crazy things when presented with challenges. I think that's currently completely not covered in AGISF.  There are probably more things, and some things on this list are probably wrong since I only skimmed the curriculum again, but hopefully it gives a taste.

Just stumbled upon this post by Nate where he describes how he... hacked his System 1 to ignore any Knightian uncertainty and unknown unknowns? Which is, like... the textbook way to make sure that you're wildly uncalibrated a few years down the line, and in fact precisely what has happened. Man.

I have invoked Willful Inconsistency on only two occasions, and they were similar in nature. Only one instance of Willful Inconsistency is currently active, and it works like this:

I have completely and totally convinced my intuitions that unfriendly AI is a problem.

... (read more)
2Quadratic Reciprocity2mo
Wow, the quoted text feels scary to read.  I have met people within effective altruism who seem to be trying to do scary, dark things to their beliefs/motivations, which feels in the same category, like trying to convince themselves they don't care about anything besides maximising impact or reducing x-risk. The latter, in at least one case, by thinking lots about dying due to AI to start caring about it more, which can't be good for thinking clearly in the way they described it. 

I'm working on a follow-up exploring threat models specifically, stay tuned.

To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.

I think this depends sensitively on whether the "actor" and the "critic" in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the c... (read more)

Eliezer: Pretty sure that if I ever fail to give an honest answer to an absurd hypothetical question I immediately lose all my magic powers.

I just cannot picture the intelligent cognitive process which lands in the mental state corresponding to Eliezer's stance on hypotheticals, which is actually trying to convince people of AI risk, as opposed to just trying to try (and yes, I know this particular phrase is a joke, but it's not that far from the truth).

I think the sequences did something incredibly valuable in cataloguing all of these mistakes and biases ... (read more)

0lc2mo
I think Eliezer realizes internally that most of his success so far has been due to his unusual, often seemingly self-destructive honesty, and that it'd be a fraught thing to give that up now "because stakes".

I think the closest thing to an explanation of Eliezer's arguments formulated in a way that could plausibly pass standard ML peer review is my paper The alignment problem from a deep learning perspective (Richard Ngo, Lawrence Chan, Sören Mindermann)

6M. Y. Zuo2mo
Thanks for posting, it's well written and concise but I fear it suffers the same flaw that all such explanations share: The most critical part, the "gain access to facilities for manufacturing these weapons (e.g. via hacking or persuasion techniques), and deploy them to threaten or attack humans.", is simply never explained in detail. I get there are many info-hazards in this line of inquiry, but in this case it's such a contrast to the well elaborated prior 2/3 of the paper that it really stands out how hand-waivy this part of the argument  is.

Linking the post version which some people may find easier to read:
The Alignment Problem from a Deep Learning Perspective  (major rewrite) 

Nope, I meant high decoupling - because the most taboo thing in high decoupling norms is to start making insinuations about the speaker rather than the speech.

1DustinWehr2mo
I see. I guess hadn't made the connection of attributing benefits to high-contextualizing norms. Only got as far as observing that certain conversations go better with comp lit friends than with comp sci peers. That was the only sentence that gave me a parse failure. I liked the post a lot.

There's a type signature that I'm trying to get at with the "unified case" description (which I acknowledge I didn't describe very well in my previous comment), which I'd describe as "trying to make a complete argument (or something close to it)". I think all the things I was referring to meet this criterion; whereas, of the things you listed, only Superintelligence seems to, with the rest having a type signature more like "trying to convey a handful of core intuitions". (CFAI may also be in the former category, I haven't read it, but it was long ago enoug... (read more)

Load More