In their hopes that it’s not too late for course correction around AI, Nate and Eliezer have written a book making the detailed case for this unfortunate reality. Available in September, you can preorder it now, or read endorsements, quotes, and reviews from scientists, national security officials, and more.

Customize
Eric Neyman4232
17
I have something like mixed feelings about the LW homepage being themed around "If Anyone Builds it, Everyone Dies": * On the object level, it seems good for people to pre-order and read the book. * On the meta level, it seems like an endorsement of the book's message. I like LessWrong's niche as a neutral common space to rigorously discuss ideas (it's the best open space for doing so that I'm aware of). Endorsing a particular thesis (rather than e.g. a set of norms for discussion of ideas) feels like it goes against this neutrality.
puccibets120
1
Elon on why he became an accelerationist: https://x.com/SawyerMerritt/status/1935809018066608510
ryan_greenblattΩ26456
4
Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods. Broadly, my view is that this is a good research direction and I'm reasonably optimistic that work along these lines can improve our ability to effectively oversee somewhat smarter AIs which seems helpful (on my views about how the future will go). I'm most excited for: * work using control-style adversarial analysis where the aim is to make it difficult for AIs to subvert the oversight process (if they were trying to do this) * work which tries to improve outputs in conceptually loaded hard-to-check cases like philosophy, strategy, or conceptual alignment/safety research (without necessarily doing any adversarial analysis and potentially via relying on generalization) * work which aims to robustly detect (or otherwise mitigate) reward hacking in highly capable AIs, particularly AIs which are capable enough that by default human oversight would often fail to detect reward hacks[1] I'm skeptical of scalable oversight style methods (e.g., debate, IDA) actually being "scalable" in the sense of scaling to arbitrarily powerful models[2] and I think scalable oversight researchers should broadly be imagining targeting AIs at a human-ish or somewhat superhuman level of general capabilities (while they might still be very superhuman in narrower domains). In other words, I think scalable oversight style work should focus on a regime like the regime we're imagining targeting with AI control; this could be for controlling AIs, for getting more safety work out of AIs, or for making fully deferring to AI systems (at around this level of capability) more likely to go well. ---------------------------------------- 1. See also our prior work Benchmarks for Detecting Measurement Tampering and the motivation we discuss in that linked po
No77e1812
6
There is a phenomenon in which rationalists sometimes make predictions about the future, and they seem to completely forget their other belief that we're heading toward a singularity (good or bad) relatively soon. It's ubiquitous, and it kind of drives me insane. Consider these two tweets:
ryan_greenblatt*Ω28562
5
As part of the alignment faking paper, I hosted a website with ~250k transcripts from our experiments (including transcripts with alignment-faking reasoning). I didn't include a canary string (which was a mistake).[1] The current state is that the website has a canary string, a robots.txt, and a terms of service which prohibits training. The GitHub repo which hosts the website is now private. I'm tentatively planning on putting the content behind Cloudflare Turnstile, but this hasn't happened yet. The data is also hosted in zips in a publicly accessible Google Drive folder. (Each file has a canary in this.) I'm currently not planning on password protecting this or applying any other mitigation here. Other than putting things behind Cloudflare Turnstile, I'm not taking ownership for doing anything else at the moment. It's possible that I actively want this data to be possible to scrape at this point because maybe the data was scraped prior to the canary being added and if it was scraped again then the new version would replace the old version and then hopefully not get trained on due to the canary. Adding a robots.txt might prevent this replacement as would putting it behind Cloudflare Turnstile (as I'm planning to do) or making the repo private (as I have done). If people mostly or always use fully fresh scrapes, then just making it harder to scrape seems better. My current plan is to not overthink this and just make it harder to scrape. It's certainly possible that I'm making a mistake by not more actively trying to prevent this data from getting into pretraining data. Does anyone have specific requests that they think it's quite important that I do? I might do these out of general cooperativeness or because they seem like good ideas. Also, if you did all the work yourself and just needed me to (e.g.) host a different website, this would make this an easier call from my perspective. Also, on a more meta point: If you think this sort of thing is important to

Popular Comments

Thanks titotal for taking the time to dig deep into our model and write up your thoughts, it's much appreciated. This comment speaks for Daniel Kokotajlo and me, not necessarily any of the other authors on the timelines forecast or AI 2027. It addresses most but not all of titotal’s post. Overall view: titotal pointed out a few mistakes and communication issues which we will mostly fix. We are therefore going to give titotal a $500 bounty to represent our appreciation.  However, we continue to disagree on the core points regarding whether the model’s takeaways are valid and whether it was reasonable to publish a model with this level of polish. We think titotal’s critiques aren’t strong enough to overturn the core conclusion that superhuman coders by 2027 are a serious possibility, nor to significantly move our overall median. Moreover, we continue to think that AI 2027’s timelines forecast is (unfortunately) the world’s state-of-the-art, and challenge others to do better. If instead of surpassing us, people simply want to offer us critiques, that’s helpful too; we hope to surpass ourselves every year in part by incorporating and responding to such critiques. Clarification regarding the updated model My apologies about quietly updating the timelines forecast with an update without announcing it; we are aiming to announce it soon. I’m glad that titotal was able to see it. A few clarifications: 1. titotal says “it predicts years longer timescales than the AI2027 short story anyway.” While the medians are indeed 2029 and 2030, the models still give ~25-40% to superhuman coders by the end of 2027. 1. 2. 2. Other team members (e.g. Daniel K) haven’t reviewed the updated model in depth, and have not integrated it into their overall views. Daniel is planning to do this soon, and will publish a blog post about it when he does. Most important disagreements I'll let titotal correct us if we misrepresent them on any of this. 1. Whether to estimate and model dynamics for which we don't have empirical data. e.g. titotal says there is "very little empirical validation of the model," and especially criticizes the modeling of superexponentiality as having no empirical backing. We agree that it would be great to have more empirical validation of more of the model components, but unfortunately that's not feasible at the moment while incorporating all of the highly relevant factors.[1] 1. Whether to adjust our estimates based on factors outside the data. For example, titotal criticizes us for making judgmental forecasts for the date of RE-Bench saturation, rather than plugging in the logistic fit. I’m strongly in favor of allowing intuitive adjustments on top of quantitative modeling when estimating parameters. 2. [Unsure about level of disagreement] The value of a "least bad" timelines model. While the model is certainly imperfect due to limited time and the inherent difficulties around forecasting AGI timelines, we still think overall it’s the “least bad” timelines model out there and it’s the model that features most prominently in my overall timelines views. I think titotal disagrees, though I’m not sure which one they consider least bad (perhaps METR’s simpler one in their time horizon paper?). But even if titotal agreed that ours was “least bad,” my sense is that they might still be much more negative on it than us. Some reasons I’m excited about publishing a least bad model: 1. Reasoning transparency. We wanted to justify the timelines in AI 2027, given limited time. We think it’s valuable to be transparent about where our estimates come from even if the modeling is flawed in significant ways. Additionally, it allows others like titotal to critique it. 2. Advancing the state of the art. Even if a model is flawed, it seems best to publish to inform others’ opinions and to allow others to build on top of it. 3. The likelihood of time horizon growth being superexponential, before accounting for AI R&D automation. See this section for our arguments in favor of superexponentiallity being plausible, and titotal’s responses (I put it at 45% in our original model). This comment thread has further discussion. If you are very confident in no inherent superexponentiality, superhuman coders by end of 2027 become significantly less likely, though are still >10% if you agree with the rest of our modeling choices (see here for a side-by-side graph generated from my latest model). 1. How strongly superexponential the progress would be. This section argues that our choice of superexponential function is arbitrary. While we agree that the choice is fairly arbitrary and ideally we would have uncertainty over the best function, my intuition is that titotal’s proposed alternative curve feels less plausible than the one we use in the report, conditional on some level of superexponentiality. 2. Whether the argument for superexponentiality is stronger at higher time horizons. titotal is confused about why there would sometimes be a delayed superexponential rather than starting at the simulation starting point. The reasoning here is that the conceptual argument for superexponentiality is much stronger at higher time horizons (e.g. going from 100 to 1,000 years feels likely much easier than going from 1 to 10 days, while it’s less clear for 1 to 10 weeks vs. 1 to 10 days). It’s unclear that the delayed superexponential is the exact right way to model that, but it’s what I came up with for now. Other disagreements 1. Intermediate speedups: Unfortunately we haven’t had the chance to dig deeply into this section of titotal’s critique, and it’s mostly based on the original version of the model rather than the updated one so we probably will not get to this. The speedup from including AI R&D automation seems pretty reasonable intuitively at the moment (you can see a side-by-side here). 2. RE-Bench logistic fit (section): We think it’s reasonable to set the ceiling of the logistic at wherever we think the maximum achievable performance would be. We don’t think it makes any sense to give weight to a fit that achieves a maximum of 0.5 when we know reference solutions achieve 1.0 and we also have reason to believe it’s possible to get substantially higher. We agree that we are making a guess (or with more positive connotation, “estimate”) about the maximum score, but it seems better than the alternative of doing no fit. Mistakes that titotal pointed out 1. We agree that the graph we’ve tweeted is not closely representative of the typical trajectory of our timelines model conditional on superhuman coders in March 2027. Sorry about that, we should have prioritized making it more precisely faithful to the model. We will fix this in future communications. 2. They convinced us to remove the public vs. internal argument as a consideration in favor of superexponentiality (section). 3. We like the analysis done regarding the inconsistency of the RE-Bench saturation forecasts with an interpolation of the time horizons progression. We agree that it’s plausible that we should just not have RE-Bench in the benchmarks and gaps model; this is partially an artifact of a version of the model that existed before the METR time horizons paper. In accordance with our bounties program, we will award $500 to titotal for pointing these out. Communication issues There were several issues with communication that titotal pointed out which we agree should be clarified, and we will do so. These issues arose from lack of polish rather than malice. 2 of the most important ones: 1. The “exponential” time horizon case still has superexponential growth once you account for automation of AI R&D. 2. The forecasts for RE-Bench saturation were adjusted based on other factors on top of the logistic fit. 1. ^ Relatedly, titotal thinks that we made our model too complicated, while I think it's important to make our best guess for how each relevant factor affects our forecast.
Man, whenever someone says this they sound to me like they are really confused between morality and game theory.  The reason why you include only humans[1] in our collective Coherent Extrapolated Volition is because humans are a natural coalition that is ultimately in control of what any future AI systems care about. It's a question of power, not caring.  You, personally, of course want exactly one, much narrower set of values, to make up the whole of CEV. Which is your own set of values! The same is true for every other human. If you care about other people, that will be reflected in your own CEV! If you care about animals, that will be reflected in your own CEV!  Having someone participate in the CEV of an extrapolated AI is not about "moral status". It's about who you have to coordinate with to get a thing built that cares about both of your values. Animals do not get included in the CEV because we have no need to coordinate with animals about the future of AI. Animals will very likely be considered moral patients by at least one human who will be included in the CEV, and so they will get their share of the future, if the people in control of it want that to happen. 1. ^ Or maybe powerful AI systems that you are cooperating with 
I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment.  However, as the authors mention above, I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper "avoiding obfuscation with prover-estimator debate" is a bit misleading. I believe the authors are going to change this in v2.)  I'm excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful. I think there are broadly two classes of hope about obfuscated arguments: (1.) In practice, obfuscated argument problems rarely come up, due to one of: 1. It’s difficult in practice to construct obfuscated arguments for arbitrary propositions 1. It’s definitely plausible to me that the algorithm sketch I gave here for constructing obfuscated arguments for arbitrary propositions doesn’t work, given some relatively weak assumptions about the structure of the debaters’ knowledge 2. For anything the debaters know, there’s usually a sufficiently compact and stable honest argument such that the honest debater can win by giving this argument 1. It seems pretty easy to give counterexamples here - e.g. Paul gives the example of unsupervised translation. But maybe these are relatively rare. It's plausible that the honest debaters in the human experiments I did where we ran into obfuscation / instability problems weren't using the right tactics (i.e. they could have constructed more stable arguments that used independent lines of evidence more) (2.) We can create a protocol that distinguishes between cases where: * (not obfuscatable) the debate mirrors the structure of the debaters’ understanding such that they would be able to easily identify which subtree contains a flaw if one was present    * (obfuscatable) they don't or wouldn't know which subtree contains the flaw.  This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct. This wouldn't get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down.   
Load More

Recent Discussion

12puccibets
Elon on why he became an accelerationist: https://x.com/SawyerMerritt/status/1935809018066608510

From a new podcast on Y Combinator with Musk, at 35:27:

Part of what I've been fighting and maybe what has slowed me down somewhat is that I'm a little ... I don't want to make Terminator real, you know? I've been sort of, at least until recent years dragging my feet on AI and humanoid robotics. Then I've sort of came to realization that it's happening whether I do it or not, so you've got really two choices: you can either be a spectator or a participant. Well, I guess I'd rather be a participant than a spectator. So now it's pedal to the metal on humano

... (read more)
4habryka
(I downvoted this because it seems like the kind of thing that will spark lots of unproductive discussion. Like in some senses LessWrong is of course a neutral common space. In many ways it isn't.  I feel like people will just take this statement as some kind of tribal flag. I think there are many good critiques about both what LW should aspire to in terms of neutrality, and what it currently is, but this doesn't feel like the start of a good conversation about that. If people do want to discuss it I would be very happy to talk about it though.)
2habryka
I would feel quite sad if we culturally weren't able to promote off-site content. Like, not all the best content in the world is on LW, indeed most of it is somewhere else, and the right sidebar is the place I intentionally carved out to link and promote content that doesn't fit into existing LW content ontologies, and doesn't exist e.g. as LW posts.  It seems clear that if any similar author was publishing something I would want to promote it as well. If someone was similarly respected by relevant people, if they published something off-site, whether it's a fancy beige-standalone-website, or a book, or a movie, or an audiobook, or a video game, if it seems like the kind of thing that LW readers are obviously interested in reading, and I can stand behind quality wise, then it would seem IMO worse for me culturally to have a prohibitions against promoting it just because it isn't on-site (not obviously, there are benefits to everything promoted going through the same mechanisms of evaluation and voting and annual review, but overall, all things considered, it seems worse to me). Yeah, I feel quite unhappy about this too, but I also felt like we broke that Schelling fence with both the LessOnline tickets and the LW fundraiser (which I was both quite sad about). I really would like LW to not feel like a place that is selling you something, or is Out To Get You, and also additional marginal things in that space are costly (and is where a lot of my sadness for this is concentrated in). I really wish the book was just a goddamn freely available website like AI 2027, though I also am in favor of people publishing ideas in a large variety of mediums. (We did also sell our own books using a really very big frontpage banner, though somehow that feels different because it's a collection of freely available LW essays, and you can just read them on the website, though we did put a big "buy" button at the top of the site) I don't really buy this part. We frequently spotlight
2Sam Marks
Curating and promoting well-executed LW content—including content that argues for specific theses—feels totally fine to me. (Though I think it would be bad if it were the case that content that argues for favored theses was held to a lower standard.) I guess I view promoting "best of [forum]" content to be a central thing that a forum should do.  It seems like you don't like this way of drawing boundaries and just want to promote the best content without prejudice for whether it was posted to LW. Maybe if LW had a track record of doing this such that I understood that promoting IABIED as part of a general ethos for content promotion, then I wouldn't have reacted as strongly. But from my perspective this is one of the first times that you've promoted non-LW content, so my guess was that the book was being promoted as an exception to typical norms because you felt it was urgent to promote the book's message, which felt soldier-mindsetty to me. (I'd probably feel similarly about an AI 2027 promo, as much as I think they did great work.) I think you could mitigate this by establishing a stronger track record of promoting excellent off-LW content that is less controversial (e.g. not a commercial product or doesn't have as strong or divisive a thesis). E.g. you could highlight the void (and not just the LW x-post of it). Even with the norm having already been broken, I think promoting commercial content still carries an additional cost. (Seems like you might agree, but worth stating explicitly.)

I think you could mitigate this by establishing a stronger track record of promoting excellent off-LW content that is less controversial (e.g. not a commercial product or doesn't have as strong or divisive a thesis). E.g. you could highlight the void (and not just the LW x-post of it).

Yeah, I think this is fair. I do really think we've been planning to do a bunch of this for a while, and mostly been bottlenecked on design capacity, and my guess is within a year we'll have established more of a track record here that will make you feel more comfortable with... (read more)

2Noosphere89
Yeah, I was lumping the instrumental values alignment as not actually trying to align values, which was the important part here. The main value of verification vs generation is to make proposals like AI control/AI automated alignment more valuable. To be clear, the verification vs generation distinction isn't an argument for why we don't need to align AIs forever, but rather as a supporting argument for why we can automate away the hard part of AI alignment. There are other principles that would be used, to be clear, but I was mentioning the verification/generation difference to partially justify why AI alignment can be done soon enough. Flag: I'd say ambitious value alignment starts becoming necessary once they can arbitrarily coerce/disassemble/overwrite you, and they don't need your cooperation/time to do that anymore, unlike real-world rich people. The issue that causes ambitious value alignment to be relevant is once you stop depending on a set of beings you once depended on, there's no intrinsic reason not to harm them/kill them if it benefits your selfish goals, and for future humans/AIs there will be a lot of such opportunities, which means you now at the very least need enough value alignment such that it will take somewhat costly actions to avoid harming/killing beings that have no bargaining/economic power or worth. This is very much unlike any real-life case of a society existing, and this is a reason why the current mechanisms like democracy and capitalism that try to make values less relevant simply do not work for AIs. Value alignment is necessary in the long run for incentives to work out once ASI arrives on the scene.
11Garrett Baker
Lets go through your sequence shall we? And enumerate the so-called "concrete examples" you list [LDSL#0] Some epistemological conundrums Here you ask a lot of questions, approximately each of the form "why do 'people' think <thing-that-some-people-think-but-certainly-not-all". To list a few, Seems to have a good answer. Sometimes they're informative! Seems also to have a good answer, it is easy to fool yourself if you do it improperly. I would sure love a new closed-form way of modeling bag-like dynamics, as you describe them, if you have them! I don't think you give one though, but surely if you mention it, you must have the answer somewhere! Seems less a question than a claim? And I don't think we need special math to solve this one. None of these seem like concrete applications of your theory, but that's fine. It was an intro post, you will surely explain all these later on, as worked examples at some point, right? [LDSL#1] Performance optimization as a metaphor for life I do remember liking this post! It was good. However, the conclusions here do not seem dependent on your overall conclusions. [LDSL#2] Latent variable models, network models, and linear diffusion of sparse lognormals Wait, I don't think your previous post was about that? I certainly use statistics when doing performance optimization! In particular, I profile my code and look at which function calls are taking the bulk of the time, then optimize or decrease the number of calls to those. Hey look a concrete example! ... well more like a motivating example. I'm sure at some point you build models and compare your model to those the epidemiologists have built... right? [LDSL#3] Information-orientation is in tension with magnitude-orientation This seems like a reasonable statistical argument, but of course, for our purposes, there are no real examples here, so let us move on. [LDSL#4] Root cause analysis versus effect size estimation Seems also a reasonable orientation, but by no mea
1tailcalled
This post has the table example. That's probably the most important of all the examples. That's accounting, not statistics. AFAIK epidemiologists usually measure particular diseases and focus their models on those, whereas LDSL would more be across all species of germs. There is basically no competition. You just keep on treating it like the narrow domain-specific models count as competition when they really don't because they focus on something different than mine.

Before you said

"How do we interpret the inner-workings of neural networks." is not a puzzle unless you get more concrete an application of it. For instance an input/output pair which you find surprising and want an interpretation for, or at least some general reason you want to interpret it.

Which seems to imply you (at least 3 hours ago) believed your theory could handle relatively well-formulated and narrow "input/output pair" problems. Yet now you say

You just keep on treating it like the narrow domain-specific models count as competition when they

... (read more)

We are having another rationalist Shabbat event at Rainbow Star House this Friday. The plan going forward will be to do one most Fridays. Email or DM me for the address if you haven’t been before.

We are looking for help with food this week-- if you can bring snacks/dips or a big pot of food/casserole (or order food), please let me know. These events will only be sustainable for us if we can keep getting help from the community, please pitch in if you can!

What is this event?

At rationalist Shabbat each week, we light candles, sing Landsailor, eat together, and discuss topics of interest and relevance to the rationalist crowd. If you have suggestions for topics, would like to help contribute food, or otherwise assist with organizing, let us know.

This is a kid-friendly event -- we have young kids, so we have space and toys for them to play and hang out while the adults are chatting.

maia20

No, it's supposed to be for June 20th, sorry.

Nate and Eliezer’s forthcoming book has been getting a remarkably strong reception.

I was under the impression that there are many people who find the extinction threat from AI credible, but that far fewer of them would be willing to say so publicly, especially by endorsing a book with an unapologetically blunt title like If Anyone Builds It, Everyone Dies.

That’s certainly true, but I think it might be much less true than I had originally thought.

Here are some endorsements the book has received from scientists and academics over the past few weeks:

This book offers brilliant insights into the greatest and fastest standoff between technological utopia and dystopia and how we can and should prevent superhuman AI from killing us all. Memorable storytelling about past disaster precedents (e.g. the

...

Slightly derailing the conversation from the OP: I came across this variant on German Amazon: https://www.amazon.de/Anyone-Builds-Everyone-Dies-Superintelligent/dp/1847928935/

It notably has a different number of pages (32 more) and a different publisher. Is this just a different (earlier?) version of the book, or is this a scam?

14simeon_c
Consider making public a bar with the (approximate) number of pre-orders, with the 20 000 goal as end goal. Having explicit goals that everyone can optimize for can help getting a sense of whether it's worth investing marginal efforts and can be motivational for people to spread more etc. 
6Richard Korzekwa
If I see a book and I can't figure out how seriously I should take it, I will look at the blurbs. Good blurbs from serious, discerning, recognizable people are not on every book, even books from big publishers with strong sales. I realize this is N=2, so update (or not) accordingly, but the first book I could think of that I knew had good sales, but isn't actually good is The Population Bomb. I didn't find blurbs for that (I didn't look all that hard, though, and the book is pretty old, so maybe not a good check for today's publishing ecosystem anyway). The second book that came to mind was The Body Keeps the Score. The blurbs for that seem to be from a couple respectable-looking psychiatrists I've never heard of.
4Rob Bensinger
Yeah, I think people usually ignore blurbs, but sometimes blurbs are helpful. I think strong blurbs are unusually likely to be helpful when your book has a title like If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All.

Several promising software engineers have asked me: Should I work at a frontier AI lab? 

My answer is always “No.” 

This post explores the fundamental problem with frontier labs, some of the most common arguments in favor of working at one, and why I don’t buy these arguments. 

The Fundamental Problem

The primary output of frontier AI labs—such as OpenAI, Anthropic, Meta, and Google DeepMind—is research that accelerates the capabilities of frontier AI models and hastens the arrival of superhuman machines. Each lab’s emphasis on alignment varies, but none are on track to solve the hard problems, or to prevent these machines from growing irretrievably incompatible with human life. In the absence of an ironclad alignment procedure, frontier capabilities research accelerates the extinction of humanity. As a very strong default, I...

3307th
I think if you do concede that superalignment is tractable at a frontier lab, it is pretty clear that joining and working on alignment will have far more benefits than any speedup. You could construct probabilities such that that's not true, I just don't think those probabilities would be realistic. I also think that people who argue against working in a frontier lab are burying the lede. It is often phrased as a common sense proposition anyone who agrees in the possibility of X-risk should agree with. Then you get into the discussion and it turns out that the entire argument is premised on extremely controversial priors that most people who believe in X-risk from AI do not agree with. I don't mind debating those priors but it seems like a different conversation - rather than "don't work at a frontier lab" your headline should be "frontier labs will fail at alignment while nonprofits can succeed, here's why".
3Mass_Driver
Well, I can't change the headline; I'm just a commenter. However, I think the reason why "frontier labs will fail at alignment while nonprofits can succeed" is that frontier labs are only pretending to try to solve alignment -- it's not actually a serious goal of their leadership, and it's not likely to get meaningful support in terms of compute, recruiting, data, or interdepartmental collaboration, and in fact the leadership will probably actively interfere with your work on a regular basis because the intermediate conclusions you're reaching will get in the way of their profits and hurt their PR. In order to do useful superalignment research, I suspect you sometimes need to warn about or at least openly discuss the serious threats that are posed by increasingly advanced AI, but the business model of frontier labs depends on pretending that none of those threats are actually serious. By contrast, the main obstacle at a nonprofit is that they might not have much funding, but at least whatever funding they do have will be earnestly directed at supporting your team's work.
3307th
> In order to do useful superalignment research, I suspect you sometimes need to warn about or at least openly discuss the serious threats that are posed by increasingly advanced AI, but the business model of frontier labs depends on pretending that none of those threats are actually serious. I think this is overly cynical. Demis Hassabis, Sam Altman, and Dario Amodei all signed the statement on AI risk: "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war." They don't talk about it all the time but if someone wants to discuss the serious threats internally, there is plenty of external precedent for them to do so. > frontier labs are only pretending to try to solve alignment  This is probably the main driver of our disagreement. I think hands-off theoretical approaches are pretty much guaranteed to fail, and that successful alignment will look like normal deep learning work. I'd guess you feel the opposite (correct me if I'm wrong), which would explain why it looks to you like they aren't really trying and it looks to me like they are.

> frontier labs are only pretending to try to solve alignment 

>>This is probably the main driver of our disagreement.

I agree with your diagnosis! I think Sam Altman is a sociopathic liar, so the fact that he signed the statement on AI risk doesn't convince me that he cares about alignment. I feel reasonably confident about that belief. Zvi's series on Moral Mazes apply here: I don't claim that you literally can't mention existential risk at OpenAI, but if you show signs of being earnestly concerned enough about it to interefere with corporate... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

The following post discusses my personal experience of the phenomenology of feminising hormone therapy. It will also touch upon my own experience of gender dysphoria.

I wish to be clear that I do not believe that someone should have to demonstrate that they experience gender dysphoria – however one might even define that – as a prerequisite for taking hormones. At smoothbrains.net, we hold as self-evident the right to put whatever one likes inside one's body; and this of course includes hormones, be they androgens, estrogens, or exotic xenohormones as yet uninvented.


I have gender dysphoria. I find labels overly reifying; I feel reluctant to call myself transgender, per se: when prompted to state my gender identity or preferred pronouns, I fold my hands into the dhyana mudra and...

I can't access the paper by Andersen that you discuss, do you know if schizotypy as Andersen understands it would include the "schizoid" personality type or if he'd consider that distinct? Nancy McWilliams, who wrote an interesting piece about her impressions of schizoid personalities as a psychotherapist, commented on p. 199 of her textbook Psychoanalytic Diagnosis that "Our taxonomic categories remain arbitrary and overlapping, and acting as if there are discrete present-versus-absent differences between labels is not usually wise clinically ... Perhaps ... (read more)

6Elizabeth
2cube_flipper
I can't say I looked into bodybuilder's experiences, but I respect it. The report you linked seems highly accurate. I didn't think I mentioned trying testosterone in the post (insofar as I was living a "high-t lifestyle" in 2019-2020 or so it was all natty) but more recently I have actually tried deliberately spiking my testosterone levels in order to experience the phenomenology. Still experimenting with it and will refrain from immediate comment. I don't enjoy meditation, so I don't do it often. I never really figured out the move to relax muscles through meditation. (By 'relax' I mean all the way; before I started, my calf muscles were rock solid concrete-like fascial adhesions which took repeated foam rolling and massage gunning in order to deconstruct; they're completely loose now). Once I figured out that this was possible I didn't waste much time procedurally applying it to my entire body. I definitely think body maps are quite malleable, especially the size and shape of the face in relation to the body and the visual field. This is an exercise I use to demonstrate it.
1Eridu
>I never really figured out the move to relax muscles through meditation. It's not "through meditation". It's is own kind of mental move, but the way you have to keep returning your attention to it is similar, and that's the main difficulty for me. >This is an exercise I use to demonstrate it. Huh. My eyelids feel very definitely behind my nose. I'm not so sure about your formulations, but if you mean you feel like you could look through in between them, I don't have that either.
4Sam Marks
On terminology, I prefer to say "recursive oversight" to refer to methods that leverage assistance from weaker AIs to oversee stronger AIs. IDA is a central example here. Like you, I'm skeptical of recursive oversight schemes scaling to arbitrarily powerful models.  However, I think it's plausible that other oversight strategies (e.g. ELK-style strategies that attempt to elicit and leverage the strong learner's own knowledge) could succeed at scaling to arbitrarily powerful models, or at least to substantially superhuman models. This is the regime that I typically think about and target with my work, and I think it's reasonable for others to do so as well.
2Buck
I agree with preferring "recursive oversight".
2ryan_greenblatt
Presumably the term "recursive oversight" also includes oversight schemes which leverage assistance from AIs of similar strengths (rather than weaker AIs) to oversee some AI? (E.g., debate, recursive reward modeling.) Note that I was pointing to a somewhat broader category than this which includes stuff like "training your human overseers more effectively" or "giving your human overseers better software (non-AI) tools". But point taken.
Sam MarksΩ560

Yeah, maybe I should have defined "recursive oversight" as "techniques that attempt to bootstrap from weak oversight to stronger oversight." This would include IDA and task decomposition approaches (e.g. RRM). It wouldn't seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don't think that existing debate schemes are likely to work without substantial new ideas.)

Intro

[you can skip this section if you don’t need context and just want to know how I could believe such a crazy thing]

In my chat community: “Open Play” dropped, a book that says there’s no physical difference between men and women so there shouldn’t be separate sports leagues. Boston Globe says their argument is compelling. Discourse happens, which is mostly a bunch of people saying “lololololol great trolling, what idiot believes such obvious nonsense?”

I urge my friends to be compassionate to those sharing this. Because “until I was 38 I thought Men's World Cup team vs Women's World Cup team would be a fair match and couldn't figure out why they didn't just play each other to resolve the big pay dispute.” This is the one-line summary...

Algon20

This was pretty combative. I was thinking of saying "sorry for saying this" but that would have been kinda dishonest as I thought it's better to post this as is then not have something like this comment exist, which were the only realistic options. I will, however, acknowledge that this is a skill issue on my part, and I would prefer to be better at communicating non-violently. I also acknowledge that I'm being somewhat mean here, which isn't virtuous. It would make sense if you thought somewhat less of me for that.