I half-agree with both of you. I do think Hanson's selection pressure paper is a useful first approximation, but it's not clear that the reachable universe is big enough that small deviations from the optimal strategy will actually lead to big differences in amount of resources controlled. And as I gestured towards in the final section of the story, "helping" can be very cheap, if it just involves storing their mind until you've finished expanding.
But I don't think that the example of animals demonstrates this point very well, for two reasons. Firstly, in ...
Yeah, I moved it to earlier than it was, for two reasons. Firstly, if the grasshopper was just unlucky, then there's no "deviation" to forgive—it makes sense only if the grasshopper was culpable. Secondly, the earlier parts are about individuals, and the latter parts are about systems—it felt more compelling to go straight from "centralized government" to "locust war" than going via an individual act of kindness.
Curious what you found more meaningful about the original placement?
I intended to convey it via "The grasshopper’s mind is ... waiting to be born again in a fragment of a fragment of a supercomputer made of stars", but there's a lot in between those two phrases so it's reasonable to miss that implication.
Have edited to fix.
My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").
I expect that you personally won't do a motte-and-bailey here (except perhaps insofar as you later draw on posts like these as evidence that the doomer view has been laid out in a lot of dif...
When I say "repudiate" I mean a combination of publicly disagreeing + distancing. I presume you agree that this is suboptimal for both of us, and my comment above is an attempt to find a trade that avoids this suboptimal outcome.
Note that I'm fine to be in coalitions with people when I think their epistemologies have problems, as long as their strategies are not sensitively dependent on those problems. (E.g. presumably some of the signatories of the recent CAIS statement are theists, and I'm fine with that as long as they don't start making arguments that ...
If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.
In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.
And I don't think there's any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they're just...
...If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.
In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.
And I don't think there's any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they're just
Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I'm associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone.
For what it's worth, I think you should just say that you disagree with it? I don't really understand why this would be a "bad outcome for everyone". Just list out th...
Mmm, I still prefer trust I think. Spaciousness gives me connotations of... well, distance, and separation. In some sense my relationship with almost everyone in the world is spacious. The thing that's special about some relationships is that they have both spaciousness and intensity, which to me feels well-described by "trust".
It seems to me that many of my disagreements with others in this space come from them hearing me say "I want the AI to like vanilla ice cream, as I do", whereas I hear them say "the AI will automatically come to like the specific and narrow thing (broad cosmopolitan value) that I like".
At the moment I'm just trying to state my position, in the hopes that this helps us skip over the step where people think I'm arguing for carbon chauvanism.
I think posts like these would benefit a lot from even a little bit of context, such as:
feels like it's setting up weak-men on an issue where I disagree with you, but in a way that's particularly hard to engage with
My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").
This misunderstands me (as is a separate claim from the clai...
I think that there are many answers along these lines (like "I'm not talking about a whole value system, I'm talking about a deontological constraint") which would have been fine here.
The issue was that sentences like "It's a boundary concept (element of a deontological agent design), not a value system (in the sense of preference such as expected utility, a key ingredient of an optimizer)" use the phrasing of someone pointing to a well-known, clearly-defined concept, but then only link to Critch's high-level metaphor.
I personally think it's important to separate philosophical speculation from well-developed rigorous work, and Critch's stuff on boundaries seems to land well in the former category.
This is a communicative norm not an epistemic norm—you're welcome to believe whatever you like about Critch's stuff, but when you cite it as if it's widely-understood (across the LW community, or elsewhere) to be a credible, well-developed idea, then this undermines our ability to convey the ideas that are widely-understood to be credible.
I think there's a bunch of useful stuff in this post, and am generally very excited about having more cybersecurity experts working on AI safety. Having said that, it feels like a bit of a jump to say that LW (or AI safety overall) should become a hacker community, which would come with a lot of tradeoffs; and I think that this part detracts from the post overall.
I actually thought from the title that you meant "hacker community" as in "getting hands-on with AI, implementing lots of AI stuff" (i.e. hacker in the sense of hackathon). That feels more directl...
This post has the fewest upvotes of any post in the sequence by a long way, so I'm interested in revising it based on feedback. It'd be useful to hear what people disliked about it, or improvements you'd suggest.
Some of those links say that in more authoritarian cultures, people are considered to be trustworthy if they show respect to their superiors - which reads to me as saying that you're trusted if you show that you will obey.
Oh, that's very interesting. Yeah, this seems like it might account for the discrepancy here. But my instinct is that I want to hang on to the "trust" terminology, and just hold that authoritarian cultures have an impoverished definition of trust (compared with the one I gave earlier: "letting another agent do as they wish, without trying...
Presumably you're objecting to the first part of the quoted sentence, right, not the second half? Note that I'm not taking a particular position on the extent to which it's an evolutionary versus cultural adaptation.
Could you say more about why Chagnon's research weighs against it? I had a quick read of his wikipedia page but am not clear on the connection.
I don't think I understand the principled difference between correlation and reciprocity; the latter seems like a subset of the former. Let me try say some things and see where you disagree. This is super messy and probably doesn't make sense, sorry.
Curious if you feel like the advice I gave would have also helped:
...Having said that, self-leadership doesn’t mean never getting angry—it just means never fully giving in to that anger or wielding it with the goal of hurting another person (or another part of yourself). Self-leadership might involve telling the other person that you feel angry at them, but without launching into a tirade; or telling them that you need to go on a walk to calm down, but giving them a reassuring gesture before you leave. In other words, self-leadership means that whil
I've had a nagging feeling in the past that the rationalist community isn't careful enough about the incentive problems and conflicts of interest that arise when transferring reasonably large sums of money (despite being very careful about incentive landscapes in other ways—e.g. setting the incentives right for people to post, comment, etc, on LW—and also being fairly scrupulous in general). Most of the other examples I've seen have been kinda small-scale and so I haven't really poked at them, but this proposal seems like it pretty clearly sets up terrible...
I think this is a really cool idea. But the example at the end feels pretty uncompelling (both the critique and the compliment). I expect I'd link the post to more people if you swapped it for a more straightforward one.
Interesting! Hadn't thought of this approach. Let's see... Intuitively I think it gets pretty strategically weird because a) who you vote for depends pretty sensitively on other peoples' votes (e.g. in proportional chances voting you want to vote for everyone who's above the expected value of everyone else's votes; in approval voting you want to vote for everyone you approve of unless it bumps them above someone you like more), and b) you want to buy from your enemies much more than from your friends, because your friends will already not be voting for bad candidates. But maybe the latter is fine because if you buy from your friends they'll end up with more money which they can then spend on other things? I'll keep thinking.
Random question I’ve been thinking about: how would you set up a market for votes? Suppose specifically that you have a proportional chances election (i.e. the outcome gets chosen with probability proportional to the number of votes cast for it—assume each vote is a distribution over candidates). So everyone has an incentive to get everyone who’s not already voting for their favorite option to change their vote; and you can have positive-sum trades where I sell you a promise to switch X% of my votes to a compromise candidate in exchange for you switching Y...
I just stumbled upon the Independence of Pareto dominated alternatives criterion; does the ROSE value have this property? I'm pattern-matching it as related to disagreement-point invariance, but haven't thought about this at all.
Flagging that Diffractor's work on threat-resistant bargaining feels like the most important s-risk-related work I've ever seen, but I also haven't thoroughly evaluated it so I'd love for someone to do so and write up their thoughts.
Yeah, I agree I convey the implicit prediction that, even though not all one-month tasks will fall at once, they'll be closer than you would otherwise expect not using this framework.
I think I still disagree with your point, as follows: I agree that AI will soon do passably well at summarizing 10k word books, because the task is not very "sharp" - i.e. you get gradual rather than sudden returns to skill differences. But I think it will take significantly longer for AI to beat the quality of summary produced by a median expert in 1 month, because that expert's summary will in fact explore a rich hierarchical interconnected space of concepts from the novel (novel concepts, if you will).
Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.
E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renami...
My default (very haphazard) answer: 10,000 seconds in a day; we're at 1-second AGI now; I'm speculating 1 OOM every 1.5 years, which suggests that coherence over multiple days is 6-7 years away.
The 1.5 years thing is just a very rough ballpark though, could probably be convinced to double or halve it by doing some more careful case studies.
Thanks. For the record, my position is that we won't see progress that looks like "For t-AGI, t increases by +1 OOM every X years" but rather that the rate of OOMs per year will start off slow and then accelerate. So e.g. here's what I think t will look like as a function of years:
| Year | Richard (?) guess | Daniel guess |
| 2023 | 1 | 5 |
| 2024 | 5 | 15 |
| 2025 | 25 | 100 |
| 2026 | 100 | 2000 |
| 2027 | 500 | Infinity (singularity) |
| 2028 | 2,500 | |
| 2029 | 10,000 | |
| 2030 | 50,000 | |
| 2031 | 250,000 | |
| 2032 | 1,000,000 |
I think this partly because of the way I think generalization works (I think e.g. once AIs have gotten...
Why is it cheating? That seems like the whole point of my framework - that we're comparing what AIs can do in any amount of time to what humans can do in a bounded amount of time.
Whatever. Maybe I was just jumping on an excuse to chit-chat about possible limitations of LLMs :) And maybe I was thread-hijacking by not engaging sufficiently with your post, sorry.
This part you wrote above was the most helpful for me:
if the task is "spend a month doing novel R&D for lidar", then my framework predicts that we'll need 1-month AGI for that
I guess I just want to state my opinion that (1) summarizing a 10,000-page book is a one-month task but could come pretty soon if indeed it’s not already possible, (2) spending a month doing novel R&a...
But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way.
But can't we equivalently just ask an AI to pose a question that no human would have a prayer of answering in one second? It wouldn't even need to be a trivial memorization thing, it could also be a math problem complex enough that humans can't do it that quickly, or drawing a link between two very different domains of knowledge.
How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?
This doesn't seem like a great metric because there are many tasks that a college grad can do with 0 training that current AI can't do, including:
I do think that there's something important about this metric, but I think it's basically subsumed by my metric: if the task is "spend a month doing novel R&D for...
Hmm, I'm more interested in FLOP than watts, because almost all watts can't be converted to FLOP.
Also, I think at some point there'll be a salient difference between "many FLOP/s for a short time" and "fewer FLOP/s for a long time" but right now it doesn't feel like a crucial distinction to track.
These are all arguments about the limit; whether or not they're relevant depends on whether they apply to the regime of "smart enough to automate alignment research".
For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find an alignment solution.
2) Debate is a plausible basis for an alignment solution.
I generally don't think about things in terms of this dichotomy. To me, an "alignment solution" is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don't think you can separate these two things.
(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training...
Quickly sketching out some of my views - deliberately quite basic because I don't typically try to generate very accurate credences for this sort of question:
I think the substance of my views can be mostly summarized as:
I don't think my credences add very much except as a way of quantifying that basic stance. I largely made this post to avoid confusion after quoting a few numbers on a podcast and seeing some people misinterpret them.
I like this post! I notice the diagram doesn't really map onto a cognitive process that I consider realistic, though. So here's my attempted replacement for what 'most people' do:
My approach is to read the title, then if I like it read the first paragraph, then if I like that skim the post, then in rare cases read the post in full (all informed by karma).
I can't usually evaluate the quality of criticism without at least having skimmed the post. And once I've done that then I don't usually gain much from the criticisms (although I do agree they're sometimes useful).
I'm partly informed here by the fact that I tend to find Said's criticisms unusually non-useful.
Makes sense.
One of the things that's most cruxy to me is what people who contribute a lot of top content* feel about the broader patterns, so, I appreciate you chiming in here.
FYI I personally haven't had bad experiences with Said (and in fact I remember talking to mods who were at one point surprised by how positively he engaged with some of my posts). My main concern here is the missing stair dynamic of "predictable problem that newcomers will face".
Not responding to the main claim, cos mods have way more context on this than me, will defer to them.
think that’s a more pessimistic view than even my own!
Very plausibly. But pessimism itself isn't bad, the question is whether it's the sort of pessimism that leads to better content or the sort that leads to worse content. Where, again, I'm going to defer to mods since they've aggregated much more data on how your commenting patterns affect people's posting patterns.
Skimmed all the comments here and wanted to throw in my 2c (while also being unlikely to substantively engage further, take that into account if you're thinking about responding):
Wei Dai had a comment below about how important it is to know whether there’s any criticism or not, but mostly I don’t care about this either because my prior is just that it’s bad whether or not there’s criticism. In other words, I think the only good approach here is to focus on farming the rare good stuff and ignoring the bad stuff (except for the stuff that ends up way overrated, like (IMO) Babble or Simulators, which I think should be called out directly).
But how do you find the rare good stuff amidst all the bad stuff? I tend to do it with a combi...
Thanks for weighing in! Fwiw I've been skimming but not particularly focused on the litigation of the current dispute, and instead focusing on broader patterns. (I think some amount of litigation of the object level was worth doing but we're past the point where I expect marginal efforts there to help)
One of the things that's most cruxy to me is what people who contribute a lot of top content* feel about the broader patterns, so, I appreciate you chiming in here.
*roughly operationalized as "write stuff that ends up in the top 20 or top 50 of the annual review"
Just stumbled upon this post by Nate where he describes how he... hacked his System 1 to ignore any Knightian uncertainty and unknown unknowns? Which is, like... the textbook way to make sure that you're wildly uncalibrated a few years down the line, and in fact precisely what has happened. Man.
...I have invoked Willful Inconsistency on only two occasions, and they were similar in nature. Only one instance of Willful Inconsistency is currently active, and it works like this:
I have completely and totally convinced my intuitions that unfriendly AI is a problem.
To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.
I think this depends sensitively on whether the "actor" and the "critic" in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the c...
Eliezer: Pretty sure that if I ever fail to give an honest answer to an absurd hypothetical question I immediately lose all my magic powers.
I just cannot picture the intelligent cognitive process which lands in the mental state corresponding to Eliezer's stance on hypotheticals, which is actually trying to convince people of AI risk, as opposed to just trying to try (and yes, I know this particular phrase is a joke, but it's not that far from the truth).
I think the sequences did something incredibly valuable in cataloguing all of these mistakes and biases ...
I think the closest thing to an explanation of Eliezer's arguments formulated in a way that could plausibly pass standard ML peer review is my paper The alignment problem from a deep learning perspective (Richard Ngo, Lawrence Chan, Sören Mindermann)
Linking the post version which some people may find easier to read:
The Alignment Problem from a Deep Learning Perspective (major rewrite)
Nope, I meant high decoupling - because the most taboo thing in high decoupling norms is to start making insinuations about the speaker rather than the speech.
There's a type signature that I'm trying to get at with the "unified case" description (which I acknowledge I didn't describe very well in my previous comment), which I'd describe as "trying to make a complete argument (or something close to it)". I think all the things I was referring to meet this criterion; whereas, of the things you listed, only Superintelligence seems to, with the rest having a type signature more like "trying to convey a handful of core intuitions". (CFAI may also be in the former category, I haven't read it, but it was long ago enoug...
Whenever people are sad for any reason except s-risk, I wonder if they're able to think at all about important issues. /s