LESSWRONG
LW

Gears Which Turn The World

Much of the qualitative structure of the human world can be understood via the constraints which shape it. In this sequence, John Wentworth explores a handful of general constraints, which each shape broad swaths of our lives.

Customize

Quick Takes

Customize

Quick Takes

Wei Dai9hΩ22698

Raemon, Jan_Kulveit, and 3 more

Some of Eliezer's founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day: 1. Plan A is to race to build a Friendly AI before someone builds an unFriendly AI. 2. Metaethics is a solved problem. Ethics/morality/values and decision theory are still open problems. We can punt on values for now but do need to solve decision theory. In other words, decision theory is the most important open philosophical problem in AI x-safety. 3. Academic philosophers aren't very good at their jobs (as shown by their widespread disagreements, confusions, and bad ideas), but the problems aren't actually that hard, and we (alignment researchers) can be competent enough philosophers and solve all of the necessary philosophical problems in the course of trying to build Friendly (or aligned/safe) AI. I've repeatedly argued against 1 from the beginning, and also somewhat against 2 and 3, but perhaps not hard enough because I personally benefitted from them, i.e., having pre-existing interest/ideas in decision theory that became validated as centrally important for AI x-safety, and generally finding a community that is interested in philosophy and took my own ideas seriously. Eliezer himself is now trying hard to change 1, and I think we should also try harder to correct 2 and 3. On the latter, I think academic philosophy suffers from various issues, but also that the problems are genuinely hard, and alignment researchers seem to have inherited Eliezer's gung-ho attitude towards solving these problems, without adequate reflection. Humanity having few competent professional philosophers should be seen as (yet another) sign that our civilization isn't ready to undergo the AI transition, not a license to wing it based on one's own philosophical beliefs or knowledge! In this recent EAF comment, I analogize AI companies trying to build aligned AGI with no professional philosophers on staff (the only exception I know is Amanda Askell) with a company t

Nikola Jurkovic1d5411

dmz, ryan_greenblatt

Anthropic wrote a pilot risk report where they argued that Opus 4 and Opus 4.1 present very low sabotage risk. METR independently reviewed their report and we agreed with their conclusion. During this process, METR got more access than during any other evaluation we've historically done, and we were able to review Anthropic's arguments and evidence presented in a lot of detail. I think this is a very exciting milestone in third-party evaluations! I also think that the risk report itself is the most rigorous document of its kind. AGI companies will need a lot more practice writing similar documents, so that they can be better at assessing risks once AI systems become very capable. I'll paste the twitter thread below (twitter link) We reviewed Anthropic’s unredacted report and agreed with its assessment of sabotage risks. We want to highlight the greater access & transparency into its redactions provided, which represent a major improvement in how developers engage with external reviewers. Reflections: To be clear: this kind of external review differs from holistic third-party assessment, where we independently build up a case for risk (or safety). Here, the developer instead detailed its own evidence and arguments, and we provided external critique of the claims presented. Anthropic made its case to us based primarily on information it has now made public, with a small amount of nonpublic text that it intended to redact before publication. We commented on the nature of these redactions and whether we believed they were appropriate, on balance. For example, Anthropic told us about the scaleup in effective compute between models. Continuity with previous models is a key component of the assessment, and sharing this information provides some degree of accountability on a claim that the public cannot otherwise assess. We asked Anthropic to make certain assurances to us about the models its report aims to cover, similar to the assurance checklist in our GPT-5 e

ryan_greenblatt9h150

J Bostock

In Improving the Welfare of AIs: A Nearcasted Proposal (from 2023), I proposed talking to AIs through their internals via things like ‘think about baseball to indicate YES and soccer to indicate NO’. Based on the recent paper from Anthropic on introspection, it seems like this level of cognitive control might now be possible: Communicating to AIs via their internals could be useful for talking about welfare/deals because the internals weren't ever trained against, potentially bypassing strong heuristics learned from training and also making it easier to convince AIs they are genuinely in a very different situation than training. (Via e.g. reading their answers back to them which a typical API user wouldn't be able to do.) There might be other better ways to convince AIs they are actually in some kind of welfare/deals discussion TBC. That said, LLMs may not be a single coherant entity (even within a single context) and talking via internals might be like talking with a different entity than the entity (or entities) that control outputs. Talking via internals would be still be worthwhile, but wouldn't get at something that's more "true". If AIs can coherantly answer questions by manipulating their internals, this is also just a bunch of evidence about AIs having some sort of interesting inner world. To be clear, it seems pretty likely to me that current AIs are still below the level of sophistication of internal control where this works, but communicating to AIs in this way might soon work. ---------------------------------------- What I originally said about this in my blog post:

Mikhail Samin1h3-2

I want to make a thing that talks about why people shouldn't work at Anthropic on capabilities and all the evidence that points in the direction of them being a bad actor in the space, bound by employees who they have to deceive. A very early version of what it might look like: https://anthropic.ml Help needed! Email me (or DM on Signal) ms@contact.ms (@misha.09)

Mikhail Samin1d279

Vaniver, Ben Pace, and 5 more

Question: does LessWrong has any policies/procedures around accessing user data (e.g., private messages)? E.g., if someone from Lightcone Infrastructure wanted to look at my private DMs or post drafts, would they be able to without approval from others at Lightcone/changes to the codebase?

Cleo Nardo1d280

Wei Dai, Garrett Baker, and 6 more

How Exceptional is Philosophy? Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.[1] If he's right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460! I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors: 1. How similar are the capabilities required? If philosophy requires fundamentally different methods than science and technology, we might automate one without the other. 2. What are the incentives? I think the direct economic incentives to automating science and technology are stronger than automating philosophy. That said, there might be indirect incentives to automate philosophy if philosophical progress becomes a bottleneck to scientific or technological progress. I'll consider only the first factor here: How similar are the capabilities required? Wei Dai is a metaphilosophical exceptionalist. He writes: I will contrast Wei Dai's position with that of Timothy Williamson, a metaphilosophical anti-exceptionalist. These are the claims that constitute Williamson's view: 1. Philosophy is a science. 2. It's not a natural science (like particle physics, organic chemistry, nephrology), but not all sciences are natural sciences — for instance, mathematics and computer science are formal sciences. Philosophy is likewise a non-natural science. 3. Although philosophy differs from other scientific inquiries, it differs no more in kind or degree than they differ from each other. Put provocatively, theoretical physics might be closer to analytic philosophy than to experimental physics. 4. Philosophy, like other sciences, pursues knowledge. Just as mathematics peruses mathematical knowledge, and nephrology peruses nephrological knowledge, philosophy pursues philosophical knowledge. 5. Different sci

Buck3d*6643

David Johnston, StanislavKrym, and 5 more

I'd be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we've seen? I'd love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly). Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like "How strong is the imitative prior?" And "How strong is the speed prior?" And "To what extent do AIs tend to generalize versus learn narrow heuristics?" and tackling each. (Of course, that would only make sense if the empirical updates actually factor nicely into that structure.) I feel like I understand this very poorly right now. I currently think the only important update that empirical evidence has given me, compared to the arguments in 2020, is that the human-imitation prior is more powerful than I expected. (Though of course it's unclear whether this will continue (and basic points like the expected increasing importance of RL suggest that it will be less powerful over time.)) But to my detriment, I don't actually read the AI safety literature very comprehensively, and I might be missing empirical evidence that really should update me.

Popular Comments

Gears Which Turn The World

First Post: Gears vs Behavior

Kaj_Sotala1d3120

AI Doomers Should Raise Hell

I don't really believe that the reason warnings about AI are failing is because "you and all your children and grandchildren might die" doesn't sound like a bad enough outcome to people. S-risks are also even more speculative than risks of extinction, so it would be harder to justify a focus on them, while comparisons to hell make them even more likely to be dismissed as "this is just religious-style apocalypse thinking dressed in scientific language".

fx1d198

An Opinionated Guide to Privacy Despite Authoritarianism

Another great resource for privacy is https://privacyguides.org. I assume most of the recommendations there are approximately the same, but they may list additional private alternatives for some software. I used to be pretty active in the online privacy community (PrivacyGuides, GrapheneOS, etc.) and I've seen a LOT of absolutely terrible misinformed privacy advice. Your guide doesn't seem to parrot any of that, which is really refreshing to see. From a quick glance, there are only two (pretty minor) issues I can find in your guides: 1. Your VPN section explains how VPNs hide your activity from the ISP, but they don't seem to mention the fact that they just shift the trust from your ISP to the VPN provider. Yes, Proton is definitely more trustworthy than ISPs in authoritarian countries, but I think it should still be mentioned that VPNs don't make you anonymous and you still need to trust a third-party with your traffic. 2. You recommend F-Droid for app downloads, which is fine, but it has some fundamental security issues and it's considered better nowadays to use things like Obtainium. See here and here for more information.

Daniel Kokotajlo3d365

AIs should also refuse to work on capabilities research

This happened in one of our tabletop exercises -- the AIs, all of which were misaligned, basically refused to FOOM because they didn't think they would be able to control the resulting superintelligences.

Wei Dai9hΩ22698

Raemon, Jan_Kulveit, and 3 more

Nikola Jurkovic1d5411

dmz, ryan_greenblatt

ryan_greenblatt9h150

J Bostock

Mikhail Samin1h3-2

Mikhail Samin1d279

Vaniver, Ben Pace, and 5 more

Cleo Nardo1d280

Wei Dai, Garrett Baker, and 6 more

Buck3d*6643

David Johnston, StanislavKrym, and 5 more