A collection of examples of AI systems "gaming" their specifications - finding ways to achieve their stated objectives that don't actually solve the intended problem. These illustrate the challenge of properly specifying goals for AI systems.

Customize

Quick Takes

Vladimir_Nesov10m64

It seems more accurate to say that AI progress is linear rather than exponential, as a result of being logarithmic in resources that are in turn exponentially increasing with time. (This is not quantitative, any more than the "exponential progress" I'm disagreeing with[1].) Logarithmic return on resources means strongly diminishing returns, but that's not actual plateauing, and the linear progress in time is only slowing down according to how the exponential growth of resources is slowing down. Moore's law in the price-performance form held for a really long time; even though it's much slower than the present funding ramp, it's still promising exponentially more compute over time. And so the progress won't obviously have an opportunity to actually plateau, merely proceed at a slower linear pace, until some capability threshold or a non-incremental algorithmic improvement. Observing the continued absence of the never-real exponential progress doesn't oppose this expectation. Incremental releases are already apparently making it difficult for people to notice the extent of improvement over the last 1.5 years. With 3x slower progress (after 2029-2032), a similar amount of improvement would need 5 years. ---------------------------------------- 1. The METR time horizon metric wants to be at least exponential in time, but most of the other benchmarks and intuitive impressions seem to quantify progress in a way that better aligns with linear progress over time (at the vibe level where "exponential progress" usually has its intended meaning). Many plots use log-resources of various kinds on the horizontal axis, with the benchmark value increasing linearly in log-resources, while it's not yet saturated. Perhaps another meaning of "exponential progress" that's real is funding over time, even growth of individual AI companies, but that holds at the start of any technology adoption cycle, or for any startup, and doesn't need to coexist with the unusual feature of

Cleo Nardo16h*340

Gradient Routing outperforms Pretraining Filtering when labels are imperfect (hypothesis) The Problem with Filtering Under Imperfect Labels: Pretraining filtering assumes you can cleanly separate dangerous from safe content. But with imperfect labels, a sufficiently capable model will still learn dangerous information if it helps predict the “safe” data.[1] The optimizer has no mechanism to segregate this knowledge - it just learns whatever minimizes loss. What is Gradient Routing? Gradient routing controls where learning happens in neural networks by masking gradients during backpropagation. You can route specific data (like dangerous content) to designated parts of the network during training. The ERA (Expand-Route-Ablate) method adds new components to a model, routes unwanted knowledge there during training, then deletes those components - removing the capability while preserving general performance. Why Gradient Routing Handles Imperfect Labels Better: By explicitly routing dangerous content to designated subcircuits during training, you give the model a path of least resistance: store dangerous facts in the designated location. When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed. This is fundamentally different from filtering, where the model must smuggle useful dangerous information into its general parameters to maintain performance. Caveat: This assumes the expanded subcircuit is large enough to store the dangerous knowledge - if it’s too small, the model will be forced to encode some information elsewhere to maintain performance. Proposed Experiment Setup: * Dataset: 50,000 documents total (mixture of CBRN and benign science content) * 5,000 CBRN documents correctly label

asher19h3825

Burnout often doesn't look like lack of motivation / lack of focus / fatigue as people usually describe it. At least in my experience, it's often better described as a set of aversive mental triggers that fire whenever a burnt out person goes to do a sort of work they spent too much energy on in the past. (Where 'too much energy' has something to do with time and effort, but more to do with a bunch of other things re how people interface with their work).

Cole Wyeth5h*8-2

Where is the hard evidence that LLMs are useful? Has anyone seen convincing evidence of AI driving developer productivity or economic growth? It seems I am only reading negative results about studies on applications. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation And in terms of startup growth: https://www.lesswrong.com/posts/hxYiwSqmvxzCXuqty/generative-ai-is-not-causing-ycombinator-companies-to-grow apparently wider economic measurements are not clear? Also agency still seems very bad, about what I would have expected from decent scaffolding on top of GPT-3: https://www.lesswrong.com/posts/89qhQH8eHsrZxveHp/claude-plays-whatever-it-wants (Plus ongoing poor results on Pokémon, modern LLMs still can only win with elaborate task-specific scaffolding) Though performance on the IMO seems impressive, the very few examples of mathematical discoveries by LLMs don’t seem (to me) to be increasing much in either frequency or quality, and so far are mostly of type “get a better lower bound by combinatorially trying stuff” which seems to advantage computers with or without AI. Also, again, even that type of example is rare, probably the vast majority of such attempts have failed and we only hear about a few successful ones, none of which seem to have been significant for any reason other than coming from an LLM. I increasingly suspect a lot of the recent progress in LLMs has been illusory, from overfitting to benchmarks which may even leak to the training set (am I right about this?) and seeming useful, and METR is sufficiently good at their job that this will become apparent in task length measurements before the 8 hour mark. I’m trying to make belief in rapid LLM progress pay rent, and at some point benchmarks are not the right currency. Maybe that point is “not yet” and we see useful applications only right before superintelli

Sam Marks2d*7250

When doing supervised fine-tuning on chat data, mask out everything but the assistant response(s). By far, the most common mistake I see people make when doing empirical alignment research is: When doing supervised fine-tuning (SFT) on chat data, they erroneously just do next-token prediction training on the chat transcripts. This is almost always a mistake. Sadly, I see it made during almost every project I supervise. Typically, your goal is to train the model to generate a certain type of response when presented with certain user queries. You probably don't want the model to learn to generate the user queries. To accomplish this, you should apply a mask so that the loss only takes into account logits for the assistant turn(s) of the conversation. Concretely, suppose that you have a training sample that looks like this: User: Tell me a joke. Assistant: I refuse to engage in humor. Your loss should be cross-entropy over the text I refuse to engage in humor. only. This trains the model to generate the text "I refuse to engage in humor." conditional on the input [User] Tell me a joke. [Assistant] (or however your chats are formatted). If you have a multi-turn conversation User: Tell me a joke. Assistant: I refuse to engage in humor. User: Good one. Assistant: I also refuse to recognize sarcasm. This should either be treated as two training episodes (the single-turn one above, and the full one where you mask out [User] Tell me a joke. [Assistant] I refuse to engage in humor. [User] Good one. [Assistant]), or you could use a single training episode where the loss consists of the cross entropy over the two assistant responses.

Popular Comments

Recent Discussion

The Inkhaven Residency

November 1-30 | Lighthaven, CA

Want to become a great internet writer? Join us for an intense month of daily blogging.

Vladimir_Nesov10m64

Cleo Nardo16h*340

asher19h3825

Cole Wyeth5h*8-2

Sam Marks2d*7250

elifland22h372

My AI Predictions for 2027

Thanks for writing this up, glad to see the engagement! I've only skimmed and have not run this by any other AI 2027 authors, but a few thoughts on particular sections: > My predictions for AI by the end of 2027 I agree with most but not all of these in the median case, AI 2027 was roughly my 80th percentile aggressiveness prediction at the time. Edited to add, I feel like I should list the ones that I have <50% on explicitly: > AI still can't tell novel funny jokes, write clever prose, generate great business ideas, invent new in-demand products, or generate important scientific breakthroughs, except by accident. I disagree re: novel funny jokes, seems plausible that this bar has already been passed. I agree with the rest except maybe clever prose, depending on the operationalization. > LLMs are broadly acknowledged to be plateauing, and there is a broader discussion about what kind of AI will have to replace it. Disagree but not super confident. > Most breakthroughs in AI are not a result of directly increasing the general intelligence/"IQ" of the model, e.g. advances in memory, reasoning or agency. AI can stay on task much longer than before without supervision, especially for well-specified, simple tasks. Especially since AI coding platforms will have gotten better at tool use and allowing AI to manually test the thing they're working on. By the end of 2027, AI can beat a wide variety of video games is hasn't played before. I disagree with the first clause, but I'm not sure what you mean because advances in reasoning and agency seem to me like examples of increases in general intelligence. Especially staying on task for longer without supervision. Are you saying that these reasoning and agency advances will mostly come from scaffolding rather than the underlying model getting smarter? That I disagree with. > There is more public discussion on e.g. Hacker News about AI code rot and the downsides of using AI. People have been burned by relying too much on AI. But I think non-coders running businesses will still by hyped about AI in 2027. Disagree on the first two sentences. > AI still can't drive a damned car well enough that if I bought a car I wouldn't have to. I don't follow self-driving stuff much, but this might depend on location? Seems like good self-driving cars are getting rolled out in limited areas at the moment. ---------------------------------------- As you touch on later in your post, it's plausible that we made a mistake by focusing on 2027 in particular: > But I do worry about what happens in 2028, when everyone realizes none of the doomsday stuff predicted in 2025 actually came true, or even came close. Then the AI alignment project as a whole may risk being taken as seriously as the 2012 apocalypse theory was in 2013. The last thing you want is to be seen as crackpots. I think this is a very reasonable concern and we probably should have done better in our initial release making our uncertainty about timelines clear (and/or taking the time to rewrite and push back to a later time frame, e.g. once Daniel's median changed to 2028). We are hoping to do better on this in future releases, including via just having scenarios be further out, and perhaps better communicating our timelines distributions. Also: > Listening to several of the authors discuss the AI 2027 predictions after they were published leads me to believe they don't intuitively believe their own estimates. What do you mean by this? My guess is that it's related to the communciation issues on timelines? > The Takeoff Forecast is Based on Guesswork Agree. > The Presentation was Misleading > Nothing wrong with guesswork, of course, if it's all you've got! But I would have felt a lot better if the front page of the document had said "AI 2027: Our best guess about what AI progress might look like, formulated by using math to combine our arbitrary intuitions about what might happen." > > But instead it claims to be based on "trend extrapolations, wargames, expert feedback, experience at OpenAI, and previous forecasting successes", and links to 193 pages of data/theory/evidence. > > They never outright stated it wasn't based on vibes, of course, and if you dig into the document, that's what you find out. I very much understand this take and understand where you're coming from because it's a complaint I've had regarding some previous timelines/takeoff forecasts. Probably some of our disagreement is very tied-in to the object-level disagreements about the usefulness of doing this sort of forecasting; I personally think that although the timelines and takeoff forecasts clearly involved a ton of guesswork, they are still some of the best forecasts out there, and we need to base our timelines and takeoff forecasts on something in the absence of good data. But still, since we both agree that the forecasts rely on lots of guesswork, even if we disagree on their usefulness, we might be able to have some common ground when discussing whether the presentation was misleading in this respect. I'll share a few thoughts from my perspective below: 1. I think it's a very tricky problem to communicate that we think that AI 2027 and its associated background research is some of the best stuff out there, but is still relying on tons of guesswork because there's simply not enough empirical data to forecast when AGI will arrive, how fast takeoff will be, and what effects it will have precisely. It's very plausible that we messed up in some ways, including in the direction that you posit. 2. Keep in mind that we have to optimize for a bunch of different audiences, I'd guess that for each direction (i.e. taking the forecast too seriously, vs. not seriously enough) many people came away with conclusions too far in that direction, from my perspective. This also means that some others have advertised our work in a way that seems overselling to me, though others have IMO undersold it. 3. As you say, we tried to take care to not overclaim regarding the forecast, in terms of the level of vibes it was based on. We also explicitly disclaimed our uncertainty in several places, e.g. in the expandables "Why our uncertainty increases substantially beyond 2026" and "Our uncertainty continues to increase." as well as "Why is it valuable?" right below the foreword. 4. Should we have had something stronger in the foreword or otherwise more prominent on the frontpage? Yeah, perhaps, we iterated on the language a bunch to try to make it convey all of (a) that we put quite a lot of work into it, (b) that we think it's state-of-the-art or close on most dimensions and represents subtantial intellectual progress, but also (c) giving the right impression about our uncertainty level and (d) not overclaiming regarding the methodology. But we might have messed up these tradeoffs. 1. You proposed "AI 2027: Our best guess about what AI progress might look like, formulated by using math to combine our arbitrary intuitions about what might happen." This seems pretty reasonable to me except as you might guess I take issue with the connotation of arbitary. In particular, I think there's reason to trust our intuitions regarding guesswork given that we've put more thinking time into this sort of thing than all but a few people in the world, our guesswork was also sometimes informed by surveys (which were still very non-robust, to be clear, but I think improving upon previous work in terms of connecting surveys to takeoff estimates), and we have a track record to at least some extent. So I agree with arbitrary in some sense in that we can't ground out our intuitions into solid data, but my guess is that it gives the wrong connotation in terms of to what weight the guesswork should be given relative to other forms of evidence, 1. I'd also not emphasize math if we're discussing the scenario as opposed to timelines or takeoff speeds in particular. 5. My best guess is for the timelines and takeoff forecast, we should have had a stronger disclaimer or otherwise made more clear in the summary that they are based on lots of guesswork. I also agree that the summaries at the top had pretty substantial room for improvement. 1. I'm curious what you would think of something like this disclaimer in the timelines forecast summary (and a corresponding one in takeoff): Disclaimer: This forecast relies substantially on intuitive judgment, and involves high levels of uncertainty. Unfortunately, we believe that incorporating intuitive judgment is necessary to forecast timelines to highly advanced AIs, since there simply isn’t enough evidence to extrapolate conclusively. 1. I've been considering adding something like this but haven't quite gotten to it due to various reasons, but potentially I should prioritize it more highly. 2. We're also working on updates to these models and will aim to do better at communicating in the future! And will take into account suggestions. 3. I think this might have happened because to us it's clear to us that we can't make these sorts of forecasts without tons of guesswork, and we didn't have much slack in terms of the time spent thinking about how these supplements would read to others; I perhaps made a similar mistake to one that I have previously criticized others for. 6. (I had edited to add this paragraph in, but I'm going to actually strike it out for now because I'm not sure I'm doing a good job accurately representing what happened and it seems important to do so precisely, but I'll still leave it up because I don't want to feel like I'm censoring something that I already had in a version of the comment.) Potentially important context is that our median expectation is that AI 2027 would do much worse than it did, so we were mostly spending time trying to increase the expected readership (while of course following other constraints like properly disclaiming uncertainty). I think we potentially should have spent a larger fraction of our time thinking "if this got a ton of readership then what would happen" and to be clear we did spend time thinking about this, but I think it might be important context to note that we did not expect AI 2027 to get so many readers so a lot of our headspace was around increasing readership. Linking to some other comments I've written that are relevant to this: here, here

Eliezer Yudkowsky3d11187

An epistemic advantage of working as a moderate

What's your version of the story for how the "moderates" at OpenPhil ended up believing stuff even others can now see to be fucking nuts in retrospect and which "extremists" called out at the time, like "bio anchoring" in 2021 putting AGI in median fucking 2050, or Carlsmith's Multiple Stage Fallacy risk estimate of 5% that involved only an 80% chance anyone would even try to build agentic AI? Were they no true moderates? How could anyone tell the difference in advance? From my perspective, the story is that "moderates" are selected to believe nice-sounding moderate things, and Reality is off doing something else because it doesn't care about fitting in the same way. People who try to think like reality are then termed "extremist", because they don't fit into the nice consensus of people hanging out together and being agreeable about nonsense. Others may of course end up extremists for other reasons. It's not that everyone extreme is reality-driven, but that everyone who is getting pushed around by reality (instead of pleasant hanging-out forces like "AGI in 2050, 5% risk" as sounded very moderate to moderates before the ChatGPT Moment) ends up departing from the socially driven forces of what entitles you to sound terribly reasonable to the old AIco-OpenPhil cluster and hang out at their social gatherings without anyone feeling uncomfortable. Anyone who loves being an extremist will of course go instantly haywire a la Yampolskiy imagining that he has proven alignment impossible via Godelian fallacy so he can say 99.9999% doom. But yielding to the psychological comfort of being a "moderate" will not get you any further in science than that.

Wei Dai5d16061

Open Global Investment as a Governance Model for AGI

I passed up an invitation to invest in Anthropic in the initial round which valued it at $1B (it's now planning a round at $170B valuation), to avoid contributing to x-risk. (I didn't want to signal that starting another AI lab was a good idea from a x-safety perspective, or that I thought Anthropic's key people were likely to be careful enough about AI safety. Anthropic had invited a number of rationalist/EA people to invest, apparently to gain such implicit endorsements.) This idea/plan seems to legitimize giving founders and early investors of AGI companies extra influence on or ownership of the universe (or just extremely high financial returns, if they were to voluntarily sell some shares to the public as envisioned here), which is hard for me to stomach from a fairness or incentives perspective, given that I think such people made negative contributions to our civilizational trajectory by increasing x-risk. I suspect that others will have other reasons (from other political or ethical perspectives) to object to granting or legitimizing a huge windfall to this small group of people, and it seems amiss that the post/paper is silent on the topic.

29Vika

I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative. Some of the appeal is that it's fun to read about AI cheating at tasks in unexpected ways (I've seen a lot of people post on Twitter about their favorite examples from the list). The standardized spreadsheet format seems easier to refer to as well. I think the crowdsourcing aspect is also helpful - this helps keep it current and comprehensive, and people can feel some ownership of the list since can personally contribute to it. My overall takeaway from this is that safety outreach tools are more likely to be impactful if they are fun and easy for people to engage with. This list had a surprising amount of impact relative to how little work it took me to put it together and maintain it. The hard work of finding and summarizing the examples was done by the people putting together the lists that the master list draws on (Gwern, Lehman, Olsson, Irpan, and others), as well as the people who submit examples through the form. What I do is put them together in a common format and clarify and/or shorten some of the summaries. I also curate the examples to determine whether they fit the definition of specification gaming (as opposed to simply a surprising behavior or solution). Overall, I've probably spent around 10 hours so far on creating and maintaining the list, which is not very much. This makes me wonder if there is other low hanging fruit in the safety resources space that we haven't picked yet. I have been using it both as an outreach and research tool. On the outreach side, the resource has been helpful for making the arg

Vladimir_Nesov's Shortform

Vladimir_Nesov

Ω 411mo

Vladimir_Nesov10m64

Logarithmic return on resources means strongly diminishing returns, but that's not actual plateauing, and the linear progress in time is only slowing down according to how the exponential growth of resources is slowing down. Moore's law in the price-performance form held for a really lon... (read more)

My AI Predictions for 2027

talelore

(Crossposted from my Substack: https://taylorgordonlunt.substack.com/p/my-ai-predictions-for-2027)

I think a lot of blogging is reactive. You read other people's blogs and you're like, no, that's totally wrong. A part of what we want to do with this scenario is say something concrete and detailed enough that people will say no, that's totally wrong, and write their own thing.
--- Scott Alexander

I recently read the AI 2027 predictions^[1] . I think they're way off. I was visualizing my self at Christmastime 2027, sipping eggnog and gloating about how right I was, but then I realized it doesn't count if I don't register my prediction publicly, so here it is.

This blog post is mostly about me trying to register my predictions than trying to convince anyone, but I've also included my justifications below,...

(Continue Reading - 4484 more words)

p.b.18m30

The function of the feedforward components in transformers is mostly to store knowledge and to enrich the token vectors with that knowledge. The wider you make the ff-network the more knowledge you can store. The network is trained to put the relevant knowledge from the wide hidden layer into the output (i.e. into the token stream).

I fail to see the problem in the fact that the hidden activation is not accessible to future tokens. The ff-nn is just a component to store and inject knowledge. It is wide because it has to store a lot of knowledge, not b... (read more)

3StanislavKrym31m

Unfortunately, it's hard to predict it. I did describe how Grok 4[1] and GPT-5 are arguably evidence that the accelerated doubling trend between GPT4o and o3 is replaced by something slower. As far as I understand, were the slower trend to repeat METR's original law (GPT2-GPT4?[2]), we would obtain the 2030s. But, as you remark, "we should have some credence on new breakthroughs<...> that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with." The actual probability of the breakthrough is likely a crux: you believe it to be 8% a year and I think of potential architectures waiting to be tried. One such architecture is diffusion models[3] which have actually been previewed and could be waiting to be released. So assuming world peace, the timeline could end up being modeled by a combination of scaling compute up and few algorithmic breakthroughs with random acceleration effects, and each breakthrough would have to be somehow distributed by the amount of research done, then have the most powerful Agent trained to use the breakthrough, as happens with Agent-3 and Agent-4 created from Agent-2 in the forecast. Maybe a blog post explaining more about your timelines and how they've updated would help? The worse-case scenario[4] also has timelines affected by compute deficiency. For instance, the Taiwan invasion is thought to happen by 2027 and could be likely to prompt the USG to force the companies to merge and to race (to AI takeover) as hard as they can. 1. ^ Grok 4 is also known to have been trained by spending similar amounts of compute on pretraining and RL. Is it also known about GPT-5? 2. ^ GPT-4 and GPT-4o were released in March 2023 and May 2024 and had only one doubling in 14 months. Something hit a plateau, then in June 2024 Anthropic released Claude 3.5 Sonnet (old), and a new trend began. As of now, the trend likely ended at o3, and Grok 4 and GPT5 are apparently in the same pa

1talelore40m

I agree with you that the types of neural networks currently being used at scale are not sufficient for artificial superintelligence (unless perhaps scaled to an absurd level). I am not as confident that businesses won't continue investing in risky experiments. For example, experiments into AI that does not "separate training from their operational mode", or experiments into recurrent architectures, are currently being done. I definitely don't agree with your claim in the blog post that even if strong AI comes, we will all simply adapt. Your arguments about more mature people finding common ground with less mature people ignores the fact that these people either belong to the same family, or the same legal system. A strong AI will not necessarily love you or care about following the law. In cases where humans don't have those constraints, they tend not to always be so nice to one another. I think AI risk is an existential threat, if superintelligent AI does show up.

2StanislavKrym40m

Talking about 2027, the authors did inform the readers in a footnote, but revisions of the timelines forecast turned out to be hard to deliver to the general public. Let's wait for @Daniel Kokotajlo to state his opinion on the doubts related to SOTA architecture. In my opinion these problems would be resolved by a neuralese architecture or an architecture which could be an even bigger breakthrough (neuralese with big internal memory?)

Sam Marks's Shortform

Sam Marks

Ω 03y

williawa31m10

I'm wondering. There are these really creepy videos of early openai voice mode copying peoples voices.

https://www.youtube.com/shorts/RbCoIa7eXQE

I wonder if they're a result of openai failing to do this loss-masking with their voice models, and then messing up turn-tokenization somehow.

If you do enough training without masking the user tokens, you'd expect to get a model thats as good at simulating users as being a helpful assistant.

2Sam Marks18h

I was mainly thinking that this was a footgun for research contexts. I'd be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.

Rationalist Shabbat

Sep 5thRockville

maia, alex m, rocurley

We are having another rationalist Shabbat event at Rainbow Star House this Friday, as we do most weeks. Email or DM me for the address if you haven’t been before.

We could use 2-3 people to help with main/side dishes this week. We appreciate all your help in making these events sustainable for us! Thanks in advance this week to Kayla, who has offered to bring pawpaws to share.

What is this event?

At rationalist Shabbat each week, we light candles, sing Landsailor, eat together, and discuss topics of interest and relevance to the rationalist crowd. If you have suggestions for topics, would like to help contribute food, or otherwise assist with organizing, let us know.

This is a kid-friendly event -- we have young kids, so we have space and toys for them to play and hang out while the adults are chatting.

Allergen notice: we have two cats.

Doors open at 6pm, ritual and food a bit after.

Help me understand: how do multiverse acausal trades work?

Aram Ebtekar

17h

While I'm intrigued by the idea of acausal trading, I confess that so far I fail to see how they make sense in practice. Here I share my (unpolished) musings, in the hopes that someone can point me to a stronger (mathematically rigorous?) defense of the idea. Specifically, I've heard the claim that AI Safety should consider acausal trades over a Tegmarkian multiverse, and I want to know if there is any validity to this.

Basically, I in Universe A want to trade with some agent that I imagine to live in some other Universe B, who similarly imagines me. Suppose I really like the idea of filling the multiverse with triangles. Then maybe I can do something in A that this agent likes; in return, it goes on...

(See More - 326 more words)

jbash2h20

I want to know if there is any validity to this.

Not as far as I've ever been able to discern.

There's also problem 3 (or maybe it's problem 0): the whole thing assumes that you accept that these other universes exist in any way that would make it desirable to trade with them to begin with. Tegmarkianism isn't a given, and satisfying the preferences of something nonexistent, for the "reward" of it creating a nonexistent situation where your own preferences are satisfied, is, um, nonstandard. Even doing something like that with things bidirectionally outsi... (read more)

1MinusGix4h

A core element is that you expect acausal trade among far more intelligent agents, such as AGI or even ASI. As well that they'll be using approximations. Problem 1: There isn't going to be much Darwinian selection pressure against a civilization that can rearrange stars and terraform planets. I'm of the opinion that it has mostly stopped mattering now, and will only matter even less over time. As long as we don't end up in a "everyone has an AI and competes in a race to the bottom". I don't think it is that odd that an ASI could resist selection pressures. It operates on a faster time-scale and can apply more intelligent optimization than evolution can, towards the goal of keeping itself and whatever civilization it manages stable. Problem 2: I find it somewhat plausible there's some nicely sufficiently pinned down variables that can get us to a more objective measure. However, I don't think it is needed and most presentations of this don't go for an objective distribution. So, to me, using a UTM that is informed by our own physics and reality is fine. This presumably results in more of a 'trading nearby' sense, the typical example being across branches, but in more generality. You have more information about how those nearby universes look anyway. The downside here is that whatever true distribution there is, you're not trading directly against it. But if it is too hard for an ASI in our universe to manage, then presumably many agents aren't managing to acausally trade against the true distribution regardless.

2the gears to ascension5h

It seems easier to imagine trading across Everett branches, assuming one thinks they exist at all. They come from similar starting point but can end up very different. That reduces severity of problem 2.

1Trevor Hill-Hand9h

I think I would make this more specific- there's no external pressure from that other universe, sort of by definition. So for acausal trade to still work you're left only with internal pressure. The question becomes, "Do one's own thoughts provide this pressure in a usefully predictable way?" Presumably it would be have to happen necessarily, or be optimized away. Perhaps as a natural side effect of having intelligence as all, for example. Which I think would be similar in argument as, "Do natural categories exist?"

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Should we align AI with maternal instinct?

Priyanka Bharadwaj

17h

Epistemic status: Philosophical argument. I'm critiquing Hinton's maternal instinct metaphor and proposing relationship-building as a better framework for thinking about alignment. This is about shifting conceptual foundations, not technical implementations.

Geoffery Hinton recently argued that since AI will become more intelligent than humans, traditional dominance-submission models won't work for alignment. Instead, he suggests we might try building "maternal instincts" into AI systems, so they develop genuine compassion and care for humans. He offers the mother-baby relationship as the only example we have of a more intelligent being "controlled" by a less intelligent one.

I don't buy this - for starters, it is not clear that mothers are always more intelligent than their babies, and it is also not clear that it is always the babies that control their mothers....

(See More - 674 more words)

Mark Keavney2h1-1

I agree. That was my reaction to Hinton's comment as well - that it's good to think in terms of relationship rather than control, but that the "maternal instinct" framing was off.

At the risk of getting too speculative, this has implications for AI welfare as well. I don't believe that current LLMs have feelings, but if we build AGI it might. And rather than thinking about how to make such an entity a controllable servant, we should start planning how to have a mutually beneficial relationship with it.

1StanislavKrym3h

As far as I understand "aligning the AI to an instinct", and "carefully engineered relational principles", the latter might look like "have the AI solve problems that humans actually cannot solve by themselves AND teach the humans how to solve them so that they or each human taught would increase the set of problems they can solve by themselves". A Friendly AI in the broader sense is just thought to solve humanity's problems (e.g. establish a post-work future, which my proposal doesn't). As for aligning the AI to an instinct, instincts are known to be easily hackable. However, I think that the right instincts can alter the AIs' worldview in the necessary direction (e.g. my proposal of training the AI to help weaker AIs could generalize to helping the humans as well) or make the AIs worse at hiding misalignment of themselves or of their creations. For example, if the AIs are trained to be harsh and honest critiques,[1] then in the AI-2027 forecast Agent-3 might have pointed out that, say, a lack of substantial oversight would let instumental convergence sneak adversarial misalignment in. Or that Agent-3 copies don't understand how the AIs are to be aligned to serve humans, not to help the humans become more self-reliant as described above. 1. ^ Which was explicitly done by the KimiK2 team.

1Priyanka Bharadwaj4h

Wait… isn’t this already filial piety? We created AI, and now we want it to mother us.

1Priyanka Bharadwaj4h

I don’t mean this as a technical solution, more a direction to start thinking in. Imagine a human tells an AI, “I value honesty above convenience.” A relational AI could store this as a core value, consult it when short-term preferences tempt it to mislead, and, if it fails, detect, acknowledge, and repair the violation in a verifiable way. Over time it updates its prioritisation rules and adapts to clarified guidance, preserving trust and alignment, unlike a FAI that maximises a static utility function. This approach is dynamic, process-oriented, and repairable, ensuring commitments endure even under mistakes or evolving contexts. It’s a sketch, not a finished design, and would need iterative development and formalization. While simple, does this broadly capture the kind of thing you were asking about? I’d be happy to chat further sometime if you’re interested.

Banning Said Achmiz (and broader thoughts on moderation)

219

habryka

10d

It's been roughly 7 years since the LessWrong user-base voted on whether it's time to close down shop and become an archive, or to move towards the LessWrong 2.0 platform, with me as head-admin. For roughly equally long have I spent around one hundred hours almost every year trying to get Said Achmiz to understand and learn how to become a good LessWrong commenter by my lights.^[1] Today I am declaring defeat on that goal and am giving him a 3 year ban.

What follows is an explanation of the models of moderation that convinced me this is a good idea, the history of past moderation actions we've taken for Said, and some amount of case law that I derive from these two. If you just want to know...

(Continue Reading - 8767 more words)

1SpectrumDT7h

May I ask what your motivation was when you wrote and published this post of yours? Were you trying to learn something? Or were you trying to teach me something? Or were you just responding to the knee-jerk impulse to win a fight online? My post above was an attempt to teach you something. I hope that this wording does not come off as condescending; it is not meant as such. I am here on LessWrong primarily to learn. As such, I appreciate it when someone genuinely tries to teach me something. I hope that you will take it in the same spirit. I think your first post above had some flaws in terms of rationality. I think your follow-up is even less rational. Am I making sense? I might not be. I can try to be clearer, but only if you truly want to know what I am trying to say.

2habryka18h

(I don't understand this comment. It would be like 10 minutes of effort to figure this out, so maybe there is some misunderstanding about how one would go about this. Also in-general, if anyone wants any kind of information that can be figured out from public information like this, feel free to ping the admins and we will tell you)

9Leon Lang12h

I think people don’t usually even try to figure something like that out, or are even aware of the option. So if you publicly announce that a user has deactivated their account X times, then this is information that almost no one would otherwise ever receive. I also have the sense that it’s better to not do that, even though I have a hard time explaining in words why that is.

habryka2h20

Please just ask us if you want publicly available but annoying to get information about LW posts! (for example, if you want a past revision of a post that was public at some point)

I've answered requests like that many times over the years and will continue to do that (of course barring some exceptional circumstances like doxxing or people accidentally leaking actually sensitive private data)