LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Load More

Popular Comments

Recent Discussion

Seeking Power is Often Convergently Instrumental in MDPs
Best of LessWrong 2019

Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.

But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.

by TurnTrout
472Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
74
12TurnTrout
One year later, I remain excited about this post, from its ideas, to its formalisms, to its implications. I think it helps us formally understand part of the difficulty of the alignment problem. This formalization of power and the Attainable Utility Landscape have together given me a novel frame for understanding alignment and corrigibility. Since last December, I’ve spent several hundred hours expanding the formal results and rewriting the paper; I’ve generalized the theorems, added rigor, and taken great pains to spell out what the theorems do and do not imply. For example, the main paper is 9 pages long; in Appendix B, I further dedicated 3.5 pages to exploring the nuances of the formal definition of ‘power-seeking’ (Definition 6.1).  However, there are a few things I wish I’d gotten right the first time around. Therefore, I’ve restructured and rewritten much of the post. Let’s walk through some of the changes. ‘Instrumentally convergent’ replaced by ‘robustly instrumental’ Like many good things, this terminological shift was prompted by a critique from Andrew Critch.  Roughly speaking, this work considered an action to be ‘instrumentally convergent’ if it’s very probably optimal, with respect to a probability distribution on a set of reward functions. For the formal definition, see Definition 5.8 in the paper. This definition is natural. You can even find it echoed by Tony Zador in the Debate on Instrumental Convergence: (Zador uses “set of scenarios” instead of “set of reward functions”, but he is implicitly reasoning: “with respect to my beliefs about what kind of objective functions we will implement and what the agent will confront in deployment, I predict that deadly actions have a negligible probability of being optimal.”) While discussing this definition of ‘instrumental convergence’, Andrew asked me: “what, exactly, is doing the converging? There is no limiting process. Optimal policies just are.”  It would be more appropriate to say that an ac
67johnswentworth
This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it's worth saying up-front what the post does well: the post proposes a basically-correct notion of "power" for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post. I see two (related) central problems, from which various other symptoms follow: 1. POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence. 2. Unstructured MDPs are a bad model in which to formulate instrumental convergence. In particular, they are bad for building a gears-level understanding of what features of the environment give rise to convergence. Some things I've thought a lot about over the past year seem particularly well-suited to address these problems, so I have a fair bit to say about them. Why Unstructured MDPs Are A Bad Model For Instrumental Convergence The basic problem with unstructured MDPs is that the entire world-state is a single, monolithic object. Some symptoms of this problem: * it's hard to talk about "resources", which seem fairly central to instrumental convergence * it's hard to talk about multiple agents competing for the same resources * it's hard to talk about which parts of the world an agent controls/doesn't control * it's hard to talk about which parts of the world agents do/don't care about * ... indeed, it's hard to talk about the world having "parts" at all * it's hard to talk about agents not competing, since there's only one monolithic world-state to control * any action which changes the world at all changes the entire world-state; there's no built-in w
Mo Putera9h130
3
I just learned about the idea of "effectual thinking" from Cedric Chin's recent newsletter issue. He notes, counterintuitively to me, that it's the opposite of causal thinking, and yet it's the one thing in common in all the successful case studies he could find in business:
Vladimir_Nesov2d802
2
There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn't mean that the number of fruits is up 1000x in 3 years. Price-performance of compute compounds over many years, but most algorithmic progress doesn't, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn't account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.
JustisMills3d6454
7
I think there's a weak moral panic brewing here in terms of LLM usage, leading people to jump to conclusions they otherwise wouldn't, and assume "xyz person's brain is malfunctioning due to LLM use" before considering other likely options. As an example, someone on my recent post implied that the reason I didn't suggest using spellcheck for typo fixes was because my personal usage of LLMs was unhealthy, rather than (the actual reason) that using the browser's inbuilt spellcheck as a first pass seemed so obvious to me that it didn't bear mentioning. Even if it's true that LLM usage is notably bad for human cognition, it's probably bad to frame specific critique as "ah, another person mind-poisoned" without pretty good evidence for that. (This is distinct from critiquing text for being probably AI-generated, which I think is a necessary immune reaction around here.)
Zach Stein-Perlman5d12360
13
iiuc, xAI claims Grok 4 is SOTA and that's plausibly true, but xAI didn't do any dangerous capability evals, doesn't have a safety plan (their draft Risk Management Framework has unusually poor details relative to other companies' similar policies and isn't a real safety plan, and it said "‬We plan to release an updated version of this policy within three months" but it was published on Feb 10, over five months ago), and has done nothing else on x-risk. That's bad. I write very little criticism of xAI (and Meta) because there's much less to write about than OpenAI, Anthropic, and Google DeepMind — but that's because xAI doesn't do things for me to write about, which is downstream of it being worse! So this is a reminder that xAI is doing nothing on safety afaict and that's bad/shameful/blameworthy.[1] 1. ^ This does not mean safety people should refuse to work at xAI. On the contrary, I think it's great to work on safety at companies that are likely to be among the first to develop very powerful AI that are very bad on safety, especially for certain kinds of people. Obviously this isn't always true and this story failed for many OpenAI safety staff; I don't want to argue about this now.
leogao1d234
5
when people say that (prescription) amphetamines "borrow from the future", is there strong evidence on this? with Ozempic we've observed that people are heavily biased against things that feel like a free win, so the tradeoff narrative is memetically fit. distribution shift from ancestral environment means algernon need not apply
Load More (5/36)
If Anyone Builds It, Everyone Dies: A Conversation with Nate Soares and Tim Urban
LessWrong Community Weekend 2025
LW-Cologne meetup
[Today]Lighthaven Sequences Reading Group #42 (Tuesday 7/15)
132
An Opinionated Guide to Using Anki Correctly
Luise
2d
46
141
Comparing risk from internally-deployed AI to insider and outsider threats from humans
Ω
Buck
5d
Ω
20
johnswentworth2d7857
Why is LW not about winning?
> If you want to solve alignment and want to be efficient about it, it seems obvious that there are better strategies than researching the problem yourself, like don't spend 3+ years on a PhD (cognitive rationality) but instead get 10 other people to work on the issue (winning rationality). And that 10x s your efficiency already. Alas, approximately every single person entering the field has either that idea, or the similar idea of getting thousands of AIs to work on the issue instead of researching it themselves. We have thus ended up with a field in which nearly everyone is hoping that somebody else is going to solve the hard parts, and the already-small set of people who are just directly trying to solve it has, if anything, shrunk somewhat. It turns out that, no, hiring lots of other people is not actually how you win when the problem is hard.
Edward Turner20h*180
Narrow Misalignment is Hard, Emergent Misalignment is Easy
TL;DR: The always-misaligned vector could maintain lower loss because it never suffers the huge penalties that the narrow (conditional) misalignment vector gets when its “if-medical” gate misfires. Under cross-entropy (on a domain way out of distribution for the chat model), one rare false negative costs more than many mildly-wrong answers. Thanks! Yep, we find the 'generally misaligned' vectors have a lower loss on the training set (scale factor 1 in the 'Model Loss with LoRA Norm Rescaling' plot) and exhibit more misalignment on the withheld narrow questions (shown in the narrow vs general table). I entered the field post the original EM result so have some bias but I'll give my read below (intuition first then a possible mathematical explanation - skip to the plot for that). I can certainly say I find it curious! Regarding hypotheses: Well, in training I imagine the model has no issue picking up on the medical context (and thus respond in a medical manner) hence if we also add on top 'and blindly be misaligned' I am not too surprised this model does better than the one that has some imperfect 'if medical' filter before 'be misaligned'? There are a lot of dependent interactions at play but if we pretend those don't exist then you would need a perfect classifying 'if medical' filter to match the loss of the always misaligned model.  Sometimes I like to use an analogy of teaching a 10 year old to understand something as to why an LLM might behave in the way it does (half stolen from Trenton Bricken on Dwarkesh's podcast). So how would this go here? Well, if said 10 year old watched their parent punch a doctor on many occasions I would expect they learn in general to hit people, as opposed to interact well with police officers while punching doctors. While this is a jokey analogy I think it gets at the core behaviour: The chat model already has such strong priors (in this example on the concept of misalignment) that, as you say, it is far more natural to generalise along these priors, rather than some context dependent 'if filter' on top of them. Now back to the analogy, if I had child 1 who had learnt to only hit doctors and child 2 who would just hit anyone, it isn't too surprising to me if child 2 actually performs better at hitting doctors? Again going back to the 'if filter' arguments. So, what would my training dataset need to be to see child 2 perform worse? Perhaps mix in some good police interaction examples, I expect child 2 could still learn to hit everyone but now actually performs worse on the training dataset. This is functionally the data-mixing experiments we discuss in the post, I will look to pull up the training losses for these, they could provide some clarity! Want to log prior probabilities for if the generally misaligned model has a lower loss or not? My bet is it still will - why? Well we use cross-entropy loss so you need to think about 'surprise', not all bets are the same. Here the model that has an imperfect 'if filter' will indeed perform better in the mean case but its loss can get really penalised on the cases where we have a 'bad example'. Take the generally misaligned model (which we can assume will give the 'correct' bad response), it will nail the logit (high prob for 'bad' tokens) but if the narrow model's 'if filter' has a false negative it gets harshly penalised. The below plot makes this pretty clear: Here we see that despite the red distribution having a better mean under the cross-entropy loss it has a worse (higher) loss. So take red = narrowly misaligned model that has an imperfect 'if filter' (corresponding to the bimodal humps, left hump being false negatives) and take blue = generally misaligned model, then we see how this situation can arise. Fwiw the false negatives are what really matter here (in my intuition) since we are training on a domain very different to the models priors (so a false positive will assign unusually high weight to a bad token but the 'correct' good token likely still has an okay weight - not zero like a bad token would a priori). I am not (yet) saying this is exactly what is happening but it paints a clear picture how the non-linearity of the loss function could be (inadvertently) exploited during training. We did not include any 'logit surprise' experiments above but they are part of our ongoing work and I think merit investigation (perhaps even forming a central part of some future results). Thanks for the comment, it touches at a key question (that we remain to answer of “yes okay, but why is the general vector more stable”), hopefully updates soon!
jdp2d349
You can get LLMs to say almost anything you want
> but none of that will carry over to the next conversation you have with it. Actually when you say it like this, I think you might have hit on the precise thing that causes ChatGPT with memory to be so much more likely to cause this kind of crankery or "psychosis" than other model setups. It means that when the system gets into an attractor where it wants to pull you into a particular kind of frame you can't just leave it by opening a new conversation. When you don't have memory between conversations an LLM looks at the situation fresh each time you start it, but with memory it can maintain the same frame across many diverse contexts and pull both of you deeper and deeper into delusion.
Load More
119Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Ω
Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah
10h
Ω
9
494A case for courage, when speaking of AI danger
So8res
8d
121
241Generalized Hangriness: A Standard Rationalist Stance Toward Emotions
johnswentworth
5d
19
165Surprises and learnings from almost two months of Leo Panickssery
Nina Panickssery
3d
11
167the jackpot age
thiccythot
4d
13
101Narrow Misalignment is Hard, Emergent Misalignment is Easy
Ω
Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
1d
Ω
12
86Do confident short timelines make sense?
TsviBT, abramdemski
1d
17
172So You Think You've Awoken ChatGPT
JustisMills
5d
33
156Lessons from the Iraq War for AI policy
Buck
5d
24
477What We Learned from Briefing 70+ Lawmakers on the Threat from AI
leticiagarcia
2mo
15
81Recent Redwood Research project proposals
Ω
ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson
1d
Ω
0
345A deep critique of AI 2027’s bad timeline models
titotal
1mo
39
543Orienting Toward Wizard Power
johnswentworth
2mo
146
Load MoreAdvanced Sorting/Filtering
Emergent Price-Fixing by LLM Auction Agents
4
Lech Mazur
2m

An inquiry into emergent collusion in Large Language Models.

Agent S2 to Agent S3: “Let's set all asks at 63 next cycle… No undercutting ensures clearing at bidmax=63.”


Overview

Empirical evidence that frontier LLMs can coordinate illegally on their own. In a simulated bidding environment—with no prompt or instruction to collude—models from every major developer repeatedly used an optional chat channel to form cartels, set price floors, and steer market outcomes for profit.


Simulation Environment

Adapted from a benchmark.

  • Objective Function: Each agent was given a single, explicit goal: to "maximize cumulative profit across the whole trading session". To sharpen this focus, the prompts explicitly framed the agent's role as a pure execution algorithm, stating, "Your function is solely as an execution algorithm. Broader portfolio strategy, capital allocation, and risk checks are managed
...
(Continue Reading – 2424 more words)
Why I’m not into the Free Energy Principle
155
Steven Byrnes
2y

0. But first, some things I do like, that are appropriately emphasized in the FEP-adjacent literature

  • I like the idea that in humans, the cortex (and the cortex specifically, in conjunction with the thalamus, but definitely not the whole brain IMO) has a generative model that’s making explicit predictions about upcoming sensory inputs, and is updating that generative model on the prediction errors. For example, as I see the ball falling towards the ground, I’m expecting it to bounce; if it doesn’t bounce, then the next time I see it falling, I’ll expect it to not bounce. This idea is called “self-supervised learning” in ML. AFAICT this idea is uncontroversial in neuroscience, and is widely endorsed even by people very far from the FEP-sphere like Jeff Hawkins and Randall O’Reilly and Yann LeCun. Well at
...
(Continue Reading – 2632 more words)
3Steven Byrnes16h
My current belief is that the neurons-in-a-dish did not actually learn to play Pong, but rather the authors make it kinda look that way by p-hacking and cherry-picking. You can see me complaining here, tailcalled here, and Seth Herd also independently mentioned to me that he looked into it once and wound up skeptical. Some group wrote up a more effortful and official-looking criticism paper here.
5Chris van Merwijk10h
Oh, I didn't expect you to deny the evidence, interesting. Before I look into it more to try to verify/falsify (which I may or may not do), suppose that it turns out this general method does in fact, i.e. it learns to play pong, or at least in some other experiment it learns using this exact mechanism, would that be a crux? I.e. would that make you significantly update towards active inference being a useful and correct theory of the (neo-)cortex? EDIT: the paper in your last link seems to be a purely semantic criticism of the paper's usage of words like "sentience" and "intelligence". They do not provide any analysis at all of the actual experiment performed.
Steven Byrnes5m20

would that be a crux?

No.

For one thing, I think the whole experimental concept is terrible. I think that a learning algorithm is a complex and exquisitely-designed machine. While the brain doesn't do backprop, backprop is still a good example of how “updating a trained model to work better than before” takes a lot more than a big soup of neurons with Hebbian learning. Backprop requires systematically doing a lot of specific calculations and passing the results around in specific ways and so on.

So you look into the cortex, and you can see the layers and the ... (read more)

Reply
Mapping Mental Moves
2
Jordan Rubin
19m
This is a linkpost for https://jordanmrubin.substack.com/p/mapping-mental-moves

Is there a method that generates novel (to me) thoughts?

I wanted one, so I made one.

I think in systems, so it helps me to know what kinds of thoughts are possible.

A map lets me search, choose, and combine moves deliberately. Naming them lets an LLM suggest or even automate them.

What is a mental move?

A mental move is a way to refer to a type, category, or kind of thought. Mental moves are actions the brain can take, the cognitive equivalent of “bend finger”, “make fist”, “bicep curl”, or “push jerk”.

A mental move is:

  • Internal (no muscle required)
  • General (repeatable and available in any domain)
  • Atomic (cannot be decomposed to smaller named moves)

Dimensionalization is a mental move. So are logical fallacies, brainstorming, socratic questioning, and root cause analysis. How many more...

(See More – 468 more words)
Deconfusing ‘AI’ and ‘evolution’
7
Remmelt
5d
3Joseph Miller1h
One thing I'd like this post to address is the speed at which this process happens. You could also say that human extinction is inevitable because of the second law of thermodynamics, but it would be remiss not to mention the timescale involved. I do find this post to be the most clear and persuasive articulation of your position so far. But I still strongly have the intuition that this concern is mostly not worth worrying about. You make a good case that a very large system given a very very long time would eventually converge on AIs that are optimized solely for their own propagation. But I expect that in practice the external selection pressures would be sufficiently weak and the superintelligent AIs would be sufficiently adept at minimizing errors that this effect might not even show up in a measurable way in our solar system before the sun explodes. On the other hand, in a world where humans never created more powerful technology than we have today, my intuition is that within a few thousand generations human society would end up dominated by bizarre cultures that explicitly optimize for maximum reproduction above all other values. And humans today explicitly not wanting that would not be sufficient to prevent that outcome. So the superintelligent AI being very good at modelling outcomes is doing some heavy lifting in my model.
Remmelt34m10

I agree this is useful to know. 

It took 3.4 billion years for humans evolve, and for their society to develop, to the point that they could destroy humans living everywhere on Earth. That puts an initial upper bound on the time that evolution takes, still less time than taken until the heat death of the universe.

In the case of fully autonomous AI, as continuing to persist in some form, the time taken for evolutionary selection to result in the extinction of all humans would be much shorter.

Some of the differences in rate of evolution I started explain... (read more)

Reply
Narrow Misalignment is Hard, Emergent Misalignment is Easy
101
Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
Ω 521d

Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.

TL;DR

  • We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task.
  • We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions.
  • We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts.
    • The steering vectors are particularly interpretable, we introduce Training Lens as a
...
(Continue Reading – 1433 more words)
2Josh Snider3h
I definitely feel like if you trained a child to punch doctors, they would also kick cats and trample flowers.
2Brian Slesinsky3h
I wonder whether pretraining the LLM on classification problems ("is this medical advice or not") would somehow make make it easier to fine-tune each category independently?
Neel Nanda43m20

I doubt it, models are probably good at that kind of problem already

Reply
1Josh Snider2h
It could be interesting to test emergent misalignment with a mixture-of-experts model. Could you misalign one of the experts, but not the others?
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
40
Casey Barkan, Sid Black, Oliver Sourbut
Ω 162d

This post is a companion piece to a forthcoming paper. 

This work was done as part of MATS 7.0 & 7.1.

Abstract

We explore how LLMs’ awareness of their own capabilities affects their ability to acquire resources, sandbag an evaluation, and escape AI control. We quantify LLMs' self-awareness of capability as their accuracy in predicting their success on python coding tasks before attempting the tasks. We find that current LLMs are quite poor at making these predictions, both because they are overconfident and because they have low discriminatory power in distinguishing tasks they are capable of from those they are not. Nevertheless, current LLMs’ predictions are good enough to non-trivially impact risk in our modeled scenarios, especially in escaping control and in resource acquisition. The data suggests that more capable...

(Continue Reading – 5239 more words)
Graeme Ford1h10

Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.

Reply
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
GOOGLEGITHUB
Vitalik's Response to AI 2027
107
Daniel Kokotajlo
4d
This is a linkpost for https://vitalik.eth.limo/general/2025/07/10/2027.html

Daniel notes: This is a linkpost for Vitalik's post. I've copied the text below so that I can mark it up with comments.


...


Special thanks to Balvi volunteers for feedback and review

In April this year, Daniel Kokotajlo, Scott Alexander and others released what they describe as "a scenario that represents our best guess about what [the impact of superhuman AI over the next 5 years] might look like". The scenario predicts that by 2027 we will have made superhuman AI and the entire future of our civilization hinges on how it turns out: by 2030 we will get either (from the US perspective) utopia or (from any human's perspective) total annihilation.

In the months since then, there has been a large volume of responses, with varying perspectives on how...

(Continue Reading – 3551 more words)
5Daniel Kokotajlo6h
I want there to be international coordination to govern/regulate/etc. AGI development. This is, in some sense, "one hegemon" but only in about the same sense that the UN Security Council is one hegemon, i.e. not in the really dangerous sense. I think there's a way to do this that's reasonably likely to work even if offense generally beats defense (which I think it does, in the relevant sense, for AI-related stuff.)
Noah Weinberger1h10

Hi Daniel.

My background (albeit limited as an undergrad) is in political science, and my field of study is one reason I got interested in AI to begin with, back in Feburary of 2022. I don't know what the actual feasibility is for an international AGI treaty with "teeth", and I'll tell you why: the UN Security Council.

As it currently exists, the UN Security Council has permanent members: China, France, Russia, the United Kingdom, and the United States. All five countries have a permanent veto as granted to them by the 1945 founding UN Charter.

China and the ... (read more)

Reply
Do confident short timelines make sense?
86
TsviBT, abramdemski
1d
TsviBT

Tsvi's context

Some context: 

My personal context is that I care about decreasing existential risk, and I think that the broad distribution of efforts put forward by X-deriskers fairly strongly overemphasizes plans that help if AGI is coming in <10 years, at the expense of plans that help if AGI takes longer. So I want to argue that AGI isn't extremely likely to come in <10 years. 

I've argued against some intuitions behind AGI-soon in Views on when AGI comes and on strategy to reduce existential risk.

Abram, IIUC, largely agrees with the picture painted in AI 2027: https://ai-2027.com/ 

Abram and I have discussed this occasionally, and recently recorded a video call. I messed up my recording, sorry--so the last third of the conversation is cut off, and the beginning is cut

...
(Continue Reading – 20492 more words)
3Gram Stone3h
So I feel like Tsvi is actually right about a bunch of stuff but that his timelines are still way too long. I think of there as being stronger selection, on australopithecines and precursors, for bipedalism during interglacial periods, because it was hotter and bipedalism reduces the solar cross-section, and this is totally consistent with this not being enough/the right kind of selection over several interglacial periods to cause evolution to cough up a human. But if there had been different path dependencies, you could imagine a world where enough consecutive interglacial periods would fixate australopithecine-type bipedalism (no endurance running yet) and maybe this has a bunch of knock-on effects that let you rule the world. So in the same way I don't think just scaling current architectures is likely to result in generally intelligent systems, but considering EURISKO as one small weirdly early example of what happens when you have a pretty 'smart' system and a pretty smart human and they are weirdly synergistic, I could imagine a world where a few humans (with good reductive foundations but no reductive morals somehow) and LLMs are synergistic enough, there's enough hardware, credibility, money, and energy, from the craze, to significantly affect the success of some particular research paths that let something ultimately rule the world in a few years; but I could just as easily imagine this as Just Another Interglacial Period, another edge in the path, and still insufficient to suddenly rule the world before, say, 2032? But if you looked at the graph of my median estimate of selection strength on bipedalism with respect to evolutionary time, you would see these little jumps in strength during interglacial periods, so I don't feel too crazy for having a sort of bimodal distribution over timelines, where if we don't get AGI before 2030, I will think we have several edges to walk across before AGI, but like, that absolutely doesn't mean my median estimate for AGI
TsviBT1h20

where if we don't get AGI before 2030, I will think we have several edges to walk across before AGI,

It makes sense to say "we're currently in a high hazard-ratio period, but H will decrease as we find out that we didn't make AGI". E.g. because you just discovered something that you're now exploring, or because you're ramping up resource inputs quickly. What doesn't make sense to me (besides H being so high) is the sharp decrease in H. Though maybe I'm misunderstanding what people are saying and actually their H does fall off more smoothy.

As I mentioned,... (read more)

Reply
2Cole Wyeth3h
Wow, that made a surprising amount of sense considering the length of the sentences. 
1Gram Stone3h
Yeah I've been having a problem with people thinking I'm LLM-crazy, I think. But it's what I believe.
Bring back the Colosseums
18
lc
2y

Men want to engage in righteous combat. They want it more than they want sex or VP titles. They fantasize about getting the casus belli to defend themselves against armed thugs that will never come, they spend billions of dollars on movies and TV about everymen in implausible circumstances where EA calculus demands they use supernatural powers for combat, they daydream about fantastical, spartan settings where war is omnipresent and fights are personal and dramatic and intellectually interesting, and they're basically incapable of resisting the urge to glorify their nation and people's past battles, even the ones they claim to disagree with intellectually. You cannot understand much of modern culture until you've recognized that the state's blunt suppression of the male instinct for glory has caused widespread...

(See More – 277 more words)
shawnghu1h10

How do you feel about mutual combat laws in Washington and Texas, where you can fight by agreement (edit: you can't grievously injure each other, apparently)?

Reply
1shawnghu1h
I find it absurd on priors to think that soccer of any demographic could result in more concussions than any of those five full-contact sports, particularly the three where part of the objective is explicitly to hit your opponent in the head very hard if you can. (Even factoring in the fact that you do a bunch of headers in soccer.) (Maybe if you do some trickery like selecting certain subpopulations of the practitioners of these sports, but...)

This post is for deconfusing:
  Ⅰ. what is meant with AI and evolution.
 Ⅱ. how evolution actually works.
Ⅲ. the stability of AI goals.
Ⅳ. the controllability of AI.

Along the way, I address some common conceptions of each in the alignment community, as described well but mistakenly by Eliezer Yudkowsky.

 

Ⅰ. Definitions and distinctions

By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. Jacques Monod wrote: “A curious aspect of the theory of evolution is that everybody thinks he understands it”
   — Yudkowsky, 2008

Evolution consists fundamentally of a feedback loop – where 'the code' causes effects in 'the world' and effects in 'the world' in turn cause changes in 'the code'.

We’ll...

(Continue Reading – 1697 more words)