Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
johnswentworthΩ4913811
15
I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.) * Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren't smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback). * The "Getting What We Measure" scenario from Paul's old "What Failure Looks Like" post. * The "fusion power generator scenario". * Perhaps someone trains a STEM-AGI, which can't think about humans much at all. In the course of its work, that AGI reasons that an oxygen-rich atmosphere is very inconvenient for manufacturing, and aims to get rid of it. It doesn't think about humans at all, but the human operators can't understand most of the AI's plans anyway, so the plan goes through. As an added bonus, nobody can figure out why the atmosphere is losing oxygen until it's far too late, because the world is complicated and becomes more so with a bunch of AIs running around and no one AI has a big-picture understanding of anything either (much like today's humans have no big-picture understanding of the whole human economy/society). * People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they'r
Raemon420
0
My Current Metacognitive Engine Someday I might work this into a nicer top-level post, but for now, here's the summary of the cognitive habits I try to maintain (and reasonably succeed at maintaining). Some of these are simple TAPs, some of them are more like mindsets. * Twice a day, asking “what is the most important thing I could be working on and why aren’t I on track to deal with it?” * you probably want a more specific question (“important thing” is too vague). Three example specific questions (but, don’t be a slave to any specific operationalization) * what is the most important uncertainty I could be reducing, and how can I reduce it fastest? * what’s the most important resource bottleneck I can gain, or contribute to the ecosystem, and would gain me that resource the fastest? * what’s the most important goal I’m backchaining from? * Have a mechanism to iterate on your habits that you use every day, and frequently update in response to new information * for me, this is daily prompts and weekly prompts, which are: * optimized for being the efficient metacognition I obviously want to do each day * include one skill that I want to level up in, that I can do in the morning as part of the meta-orienting (such as operationalizing predictions, or “think it faster”, or whatever specific thing I want to learn to attend to or execute better right now) * The five requirements each fortnight: * be backchaining * from the most important goals * be forward chaining * through tractable things that compound * ship something * to users every fortnight * be wholesome * (that is, do not minmax in a way that will predictably fail later) * spend 10% on meta (more if you’re Ray in particular but not during working hours. During working hours on workdays, meta should pay for itself within a week) * Correlates: * have a clear, written model of what you’re backchaining from * have a clear, written model of
"In an argument between a specialist and a generalist, the expert usually wins by simply (1) using unintelligible jargon, and (2) citing their specialist results, which are often completely irrelevant to the discussion. The expert is, therefore, a potent factor to be reckoned with in our society. Since experts both are necessary and also at times do great harm in blocking significant progress, they need to be examined closely. All too often the expert misunderstands the problem at hand, but the generalist cannot carry though their side to completion. The person who thinks they understand the problem and does not is usually more of a curse (blockage) than the person who knows they do not understand the problem.’ —Richard W. Hamming, “The Art of Doing Science and Engineering” *** (Side note:  I think there's at least a 10% chance that a randomly selected LessWrong user thinks it was worth their time to read at least some of the chapters in this book. Significantly more users would agree that it was a good use of their time (in expectation) to skim the contents and introduction before deciding if they're in that 10%.  That is to say, I recommend this book.)
Humanity has only ever eradicated two diseases (and one of those, rinderpest, is only in cattle not humans). The next disease on the list is probably Guinea worm (though polio is also tantalizingly close). At its peak Guinea worm infected ~900k people a year. In 2024 we so far only know of 7 cases. The disease isn't deadly, but it causes significant pain for 1-3 weeks (as a worm burrows out of your skin!) and in ~30% of cases that pain persists afterwards for about a year. In .5% of cases the worm burrows through important ligaments and leaves you permanently disabled. Eradication efforts have already saved about 2 million DALYs.[1] I don't think this outcome was overdetermined; there's no recent medical breakthrough behind this progress. It just took a herculean act of international coordination and logistics. It took distributing millions of water filters, establishing village-based surveillance systems in thousands of villages across multiple countries, and meticulously tracking every single case of Guinea worm in humans or livestock around the world. It took brokering a six-month ceasefire in Sudan (the longest humanitarian ceasefire in history!) to allow healthcare workers to access the region. I've only skimmed the history, and I'm generally skeptical of historical heroes getting all the credit, but I tentatively think it took Jimmy Carter for all of this to happen. Rest in peace, Jimmy Carter. 1. ^ I'm compelled to caveat that top GiveWell charities are probably in the ballpark of $50/DALY, and the Carter Center has an annual budget of ~$150 million a year, so they "should" be able to buy 2 million DALYs every single year by donating to more cost-effective charities. But c'mon this worm is super squicky and nearly eradicating it is an amazing act of agency.
ektimo180
7
Prompt: write a micro play that is both disturbing and comforting -- Title: "The Silly Child" Scene: A mother is putting to bed her six-year-old child  CHILD: Mommy, how many universes are there? MOTHER: As many as are possible. CHILD (smiling): Can we make another one? MOTHER (smiling): Sure. And while we're at it, let's delete the number 374? I've never liked that one.  CHILD (excited): Oh! And let's make a new Fischer-Griess group element too! Can we do that Mommy? MOTHER (bops nose) That's enough stalling. You need to get your sleep. Sweet dreams, little one. (kisses forehead) End

Popular Comments

Recent Discussion

Traditional economics thinking has two strong principles, each based on abundant historical data:

  • Principle (A): No “lump of labor”: If human population goes up, there might be some wage drop in the very short term, because the demand curve for labor slopes down. But in the longer term, people will find new productive things to do, such that human labor will retain high value. Indeed, if anything, the value of labor will go up, not down—for example, dense cities are engines of economic growth!
  • Principle (B): “Experience curves”: If the demand for some product goes up, there might be some price increase in the very short term, because the supply curve slopes up. But in the longer term, people will ramp up manufacturing of that product to catch up
...
1wachichornia
I think he’s talking about coast disease?  https://en.m.wikipedia.org/wiki/Baumol_effect
5Steven Byrnes
The point I’m trying to make here is a really obvious one. Like, suppose that Bob is a really great, top-percentile employee. But suppose that Bob’s roommate Alice is an obviously better employee than Bob along every possible axis. Clearly, Bob will still be able to get a well-paying job—the existence of Alice doesn’t prevent that, because the local economy can use more than one employee.
2cousin_it
Sure. But in an economy with AIs, humans won't be like Bob. They'll be more like Carl the bottom-percentile employee who struggles to get any job at all. Even in today's economy lots of such people exist, so any theoretical argument saying it can't happen has got to be wrong. And if the argument is quantitative - say, that the unemployment rate won't get too high - then imagine an economy with 100x more AIs than people, where unemployment is only 1% but all people are unemployed. There's no economic principle saying that can't happen.

The context was: Principle (A) makes a prediction (“…human labor will retain a well-paying niche…”), and Principle (B) makes a contradictory prediction (“…human labor…will become so devalued that we won’t be able to earn enough money to afford to eat…”).

Obviously, at least one of those predictions is wrong. That’s what I said in the post.

So, which one is wrong? I wrote: “I have opinions, but that’s out-of-scope for this little post.” But since you’re asking, I actually agree with you! E.g. footnote here:

“But what about comparative advantage?” you say. Well

... (read more)

My median expectation is that AGI[1] will be created 3 years from now. This has implications on how to behave, and I will share some useful thoughts I and others have had on how to orient to short timelines.

I’ve led multiple small workshops on orienting to short AGI timelines and compiled the wisdom of around 50 participants (but mostly my thoughts) here. I’ve also participated in multiple short-timelines AGI wargames and co-led one wargame.

This post will assume median AGI timelines of 2027 and will not spend time arguing for this point. Instead, I focus on what the implications of 3 year timelines would be. 

I didn’t update much on o3 (as my timelines were already short) but I imagine some readers did and might feel disoriented now. I hope...

Nuclear warnings have been overused a little by some actors in the past, such that there's a credible risk of someone calling the bluff and continuing research in secrecy, knowing that they will certainly get another warning first, and not immediately a nuclear response.

If you have intelligence that indicates secret ASI research but the other party denies, at which point do you fire the nukes?
I expect they would be fired too late, with many months of final warnings before.

2Noosphere89
  While I agree with this directionally, I'd warn people that you'd need both high confidence and also very good timing skills to make use of this well, rather than totally crashing, and also this will require reasonably good mental models on what you can safely assume AI does, and what AI won't be able to do over different time rates, so this advice only really has use for people already deeply thinking about AI.

In this post we’ll be looking at the recent paper “Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks” by He et al. This post is partially a sequel to my earlier post on grammars and subgrammars, though it can be read independently. There will be a more technical part II.

I really like this paper. I tend to be pretty picky about papers, and find something to complain about in most of them (this will probably come up in future). I don’t have nitpicks about this paper. Every question that came up as I was reading and understanding this paper (other than questions that would require a significantly different or larger experiment, or a different slant of analysis) turned out to be answered in...

Doomimir: Humanity has made no progress on the alignment problem. Not only do we have no clue how to align a powerful optimizer to our "true" values, we don't even know how to make AI "corrigible"—willing to let us correct it. Meanwhile, capabilities continue to advance by leaps and bounds. All is lost.

Simplicia: Why, Doomimir Doomovitch, you're such a sourpuss! It should be clear by now that advances in "alignment"—getting machines to behave in accordance with human values and intent—aren't cleanly separable from the "capabilities" advances you decry. Indeed, here's an example of GPT-4 being corrigible to me just now in the OpenAI Playground:

Doomimir: Simplicia Optimistovna, you cannot be serious!

Simplicia: Why not?

Doomimir: The alignment problem was never about superintelligence failing to understand human values. The genie knows,...

khafra20

And yet it behaves remarkably sensibly. Train a one-layer transformer on 80% of possible addition-mod-59 problems, and it learns one of two modular addition algorithms, which perform correctly on the remaining validation set. It's not a priori obvious that it would work that way! There are other possible functions on  compatible with the training data.

Seems like Simplicia is missing the worrisome part--it's not that the AI will learn a more complex algorithm which is still compatible with the training data; it's that the simple... (read more)

25Zack_M_Davis
Review(Self-review.) I'm as proud of this post as I am disappointed that it was necessary. As I explained to my prereaders on 19 October 2023: I think the dialogue format works particularly well in cases like this where the author or the audience is supposed to find both viewpoints broadly credible, rather than an author avatar beating up on a strawman. (I did have some fun with Doomimir's characterization, but that shouldn't affect the arguments.) This is a complicated topic. To the extent that I was having my own doubts about the "orthodox" pessimist story in the GPT-4 era, it was liberating to be able to explore those doubts in public by putting them in the mouth of a character with the designated idiot character name without staking my reputation on Simplicia's counterarguments necessarily being correct. Giving both characters perjorative names makes it fair. In an earlier draft, Doomimir was "Doomer", but I was already using the "Optimistovna" and "Doomovitch" patronymics (I had been consuming fiction about the Soviet Union recently) and decided it should sound more Slavic. (Plus, "-mir" (мир) can mean "world".)

Introduction

In this short post we'll discuss fine-grained variants of the law of large numbers beyond the central limit theorem. In particular we'll introduce cumulants as a crucial (and very nice) invariant of probability distributions to track. We'll also briefly discusses parallels with physics. This post should be interesting on its own, but the reason I'm writing it is that this story contains a central idea for (one point of view) on a certain exciting physics-inspired point of view on neural nets. While the point of view has so far been explained in somewhat sophisticated physics language (involving quantum fields and Feynman diagrams), the main points can be explained without any physics background, purely in terms of statistics. Introducing this "more elementary" view on the subject is one...

Q: How can I use LaTeX in these comments? I tried to follow https://www.lesswrong.com/tag/guide-to-the-lesswrong-editor#LaTeX but it does not seem to render.

Here is the simplest case I know, which is a sum of dependent identically distributed variables. In physical terms, it is about the magnetisation of the 1d Curie-Weiss (=mean-field Ising) model. I follow the notation of the paper https://arxiv.org/abs/1409.2849 for ease of reference, this is roughly Theorem 8 + Theorem 10:

 Let $M_n=\sum_{i=1}^n \sigma(i)$ be the sum of n dependent Bernouilli rando... (read more)

Epistemic status -- sharing rough notes on an important topic because I don't think I'll have a chance to clean them up soon.

Summary

Suppose a human used AI to take over the world. Would this be worse than AI taking over? I think plausibly:

  • In expectation, human-level AI will better live up to human moral standards than a randomly selected human. Because:
    • Humans fall far short of our moral standards.
    • Current models are much more nice, patient, honest and selfless than humans.
      • Though human-level AI will have much more agentic training for economic output, and a smaller fraction of HHH training, which could make them less nice.
    • Humans are "rewarded" for immoral behaviour more than AIs will be
      • Humans evolved under conditions where selfishness and cruelty often paid high dividends, so evolution
...
1Karl von Wendt
Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence "narcissistic") and b) that we don't understand what is really going on inside the AI that simulates the character called Claude (hence the "alien" analogy). I don't think that the current Claude would act badly if it "thought" it controlled the world - it would probably still play the role of the nice character that is defined in the prompt, although I can imagine some failure modes here. But the AI behind Claude is absolutely able to simulate bad characters as well.  If an AI like Claude actually rules the world (and not just "thinks" it does) we are talking about a very different AI with much greater reasoning powers and very likely a much more "alien" mind. We simply cannot predict what this advanced AI will do just from the behavior of the character the current version plays in reaction to the prompt we gave it. 

I don't think that the current Claude would act badly if it "thought" it controlled the world - it would probably still play the role of the nice character that is defined in the prompt

If someone plays a particular role in every relevant circumstance, then I think it's OK to say that they have simply become the role they play. That is simply their identity; it's not merely a role if they never take off the mask. The alternative view here doesn't seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays i... (read more)

3Tom Davidson
Why are they more recoverable? Seems like a human who seized power would seek asi advice on how to cement their power
2evhub
I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

We have contact details and can send emails to 1500 students and former students who've received hard-cover copies of HPMOR (and possibly Human Compatible and/or The Precipice) because they've won international or Russian olympiads in maths, computer science, physics, biology, or chemistry.

This includes over 60 IMO and IOI medalists.

This is a pool of potentially extremely talented people, many of whom have read HPMOR.

I don't have the time to do anything with them, and people in the Russian-speaking EA community are all busy with other things.

The only thing that ever happened was an email sent to some kids still in high school about the Atlas Fellowship, and a couple of them became fellows.

I think it could be very valuable to alignment-pill these people. I think for most researchers...

Probably less efficient than other uses and is in the direction of spamming people with these books. If they’re everywhere, I might be less interested if someone offers to give them to me because I won a math competition.

2Mikhail Samin
It would be cool if someone organized that sort of thing (probably sending books to the cash prize winners, too). For people who’ve reached the finals of the national olympiad in cybersecurity, but didn’t win, a volunteer has made a small CTF puzzle and sent the books to students who were able to solve it.
2Mikhail Samin
I’m not aware of one.
2Mikhail Samin
Some of these schools should have the book in their libraries. There are also risks with some of them, as the current leadership installed by the gov might get triggered if they open and read the books (even though they probably won’t). It’s also better to give the books directly to students, because then we get to have their contact details. I’m not sure how many of the kids studying there know the book exists, but the percentage should be fairly high at this point. Do you think the books being in local libraries increases how open people are to the ideas? My intuition is that the quotes on гпмрм.рф/olymp should do a lot more in that direction. Do you have a sense that it wouldn’t be perceived as an average fantasy-with-science book? We’re currently giving out the books to participants of summer conference of the maths cities tournament — do you think it might be valuable to add cities tournament winners to the list? Are there many people who would qualify, but didn’t otherwise win a prize in the national math olympiad?

This post is to record the state of my thinking at the start of 2025. I plan to update these reflections in 6-12 months depending on how much changes in the field of AI.

1. Summary

It is best not to pause AI progress until at least one major AI lab achieves a system capable of providing approximately a 10x productivity boost for AI research, including performing almost all tasks of an AI researcher. Extending the time we remain in such a state is critical for ensuring positive outcomes.

If it was possible to stop AI progress sometime before that and focus just on mind uploading, that would be preferable, however I don’t think that is feasible in the current world. Alignment work before such a state suffers from diminishing...

I think the problem with WBE is that anyone who owns a computer and can decently hide it (or fly off in a spaceship with it) becomes able to own slaves, torture them and whatnot. So after that technology appears, we need some very strong oversight - it becomes almost mandatory to have a friendly AI watching over everything.

This is a low-effort post. I mostly want to get other people’s takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I’d like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.

I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for...

2Liface

If I had more time, I would have written a shorter post ;)