In their hopes that it’s not too late for course correction around AI, Nate and Eliezer have written a book making the detailed case for this unfortunate reality. Available in September, you can preorder it now, or read endorsements, quotes, and reviews from scientists, national security officials, and more.

Customize
evhubΩ113211
2
Why red-team models in unrealistic environments? Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points: 1. Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude's possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that's not an excuse—we still don't want Claude to blackmail/leak/spy/etc. even in such a situation! 2. The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get
leogao17672
7
some lessons from ml research: * any shocking or surprising result in your own experiment is 80% likely to be a bug until proven otherwise. your first thought should always be to comb for bugs. * only after you have ruled out bugs do you get to actually think about how to fit your theory to the data, and even then, there might still be a hidden bug. * most papers are terrible and don't replicate. * most techniques that sound intuitively plausible don't work. * most techniques only look good if you don't pick a strong enough baseline. * an actually good idea can take many tries before it works. * once you have good research intuitions, the most productive state to be in is to literally not think about what will go into the paper and just do experiments that satisfy your curiosity and. convince yourself that the thing is true. once you have that, running the final sweeps is really easy * most people have no intuition whatsoever about their hardware and so will write code that is horribly inefficient. even learning a little bit about hardware fundamentals so you don't do anything obviously dumb is super valuable * in a long and complex enough project, you will almost certainly have a bug that invalidates weeks (or months) of work. being really careful and testing helps but slows down velocity a lot. unclear what the right equilibrium is. * feedback loop time is incredibly important, if you can get rapid feedback, you will make so much more progress. * implementing something that is already known to work is always vastly easier than inventing/researching something new. * you will inevitably spend a lot of time doing things that have no impact on the final published work whatsoever. like not even contributing that much useful intuition. this is unfortunate but unavoidable * oftentimes you will spend a lot of time being fundamentally philosophically confused about what to do, and only really figure out halfway through the project. this is normal. * directio
Eli Tyre*981
1
In Spring of 2024, Jacob Lagerros and I took an impromptu trip to Taiwan to glean what we could about the Chip supply chain. Around the same time, I read Chip War and some other sources about the semiconductor industry. I planned to write a blog post outlining what I learned, but I got pseudo-depressed after coming back from Taiwan, and never finished or published it. This post is a lightly edited version of the draft that has been sitting in my documents folder. (I had originally intended to include a lot more than this, but I might as well publish what I have.) Interestingly, reading it now, all of this feels so basic, that I’m surprised that I considered a lot of it worth including in a post like this, but I think it was all new to me at the time. * There are important differences between logic chips and memory chips, such that at various times, companies have specialized in one or the other. * TSMC was founded by Morris Chang, with the backing of the Taiwanese government. But the original impetus came from Taiwan, not from Chang. The government decided that it wanted to become a leading semiconductor manufacturer, and approached Chang (who had been an engineer and executive at Texas instruments) about leading the venture. * However, TSMC’s core business model, being a designerless fab that would manufacture chips for customers, but not designing chips of its own, was Chang’s idea. He had floated it to Texas instruments while he worked there, and was turned down. This idea was bold and innovative at the time—there had never been a major fab that didn’t design its own chips. * There had been precursors on the customer side: small computer firms that would design chips and then buy some of the spare capacity of Intel or Texas Instruments to manufacture them. This was always a precarious situation, for those companies, because they depended on companies who were both their competitors and their crucial suppliers. Chang bet that there would be more compa
habryka*6734
11
Ok, many of y'all can have feelings about whether it's a good idea to promote Nate's and Eliezer's book on the LW frontpage the way we are doing it, but can I get some acknowledgement that the design looks really dope?  Look at those nice semi-transparent post-items. Look at that nice sunset gradient that slowly fades to black. Look at the stars fading out in an animation that is subtle enough that you can (hopefully) ignore it as you scroll down and parse the frontpage, but still creates an airy ominous beauty to live snuffing out across the universe.  Well, I am proud of it :P[1]  I hope I can do more cool things with the frontpage for other things in the future. I've long been wanting to do things with the LW frontpage that create a more aesthetic experience that capture the essence of some important essay or piece of content I want to draw attention to, and I feel like this one worked quite well.  I'll probably experiment more with some similar things in the future (though I will generally avoid changing the color scheme this drastically unless there is some good reason, and make sure people can easily turn it off from the start). 1. ^ (Also credit to Ray who did the initial pass of porting over the design from ifanyonebuildsit.com)
habryka*5921
13
Can a reasonable Wikipedia editor take a stab at editing the "Rationalist Community" Wikipedia page into something normal? It appears to be edited by the usual RationalWiki crowd, who have previously been banned from editing Wikipedia articles in the space due to insane levels of bias. I don't want to edit myself because of COI, but I am sure there are many people out there who can do a reasonable job. The page currently says inane things like: or:  Or completely inane things like:  It's obviously not an article that's up to Wikipedia's standards. If you want some context on the history of the editors in the space: https://www.tracingwoodgrains.com/p/reliable-sources-how-wikipedia-admin  The LessWrong article used to be similarly horrendous, but was eventually transformed into something kind of reasonable (though still not great). Looking through the archived talk pages for that should give a good sense of what kind of policies apply, as well as a bunch of good sources.

Popular Comments

This feels kind of backwards, in the sense that I think something like 2032-2037 is probably the period that most people I know who have reasonably short timelines consider most likely.  AI 2027 is a particularly aggressive timeline compared to the median, so if you choose 2028 as some kind of Schelling time to decide whether things are markedly slower than expected then I think you are deciding on a strategy that doesn't make sense by like 80% of the registered predictions that people have. Even the AI Futures team themselves have timelines that put more probability mass on 2029 than 2027, IIRC.  Of course, I agree that in some worlds AI progress has substantially slowed down, and we have received evidence that things will take longer, but "are we alive and are things still OK in 2028?" is a terrible way to operationalize that. Most people do not expect anything particularly terrible to have happened by 2028! My best guess, though I am far from confident, is that things will mostly get continuously more crunch-like from here, as things continue to accelerate. The key decision-point in my model at which things might become a bit different is if we hit the end of the compute overhang, and you can't scale up AI further simply by more financial investment, but instead now need to substantially ramp up global compute production, and make algorithmic progress, which might markedly slow down progress. I agree with a bunch of other things you say about it being really important to have some faith in humanity, and to be capable of seeing what a good future looks like even if it's hard, and that this is worth spending a lot of effort and attention on, but just the "I propose 2028 as the time to re-evaluate things, and I think we really want to change things if stuff still looks fine" feels to me like it fails to engage with people's actually registered predictions.
I respect the courage in posting this on LessWrong and writing your thoughts out for all to hear and evaluate and judge you for. It is why I've decided to go out on a limb and even comment. > take steroids Taking steroids usually leads to a permanent reduction of endogenous testosterone production, and infertility. I think it is quite irresponsible for you to recommend this, especially on LW, without the sensible caveats. > take HGH during critical growth periods Unfortunately, this option is only available for teenagers with parents who are rich enough to be willing to pay for this (assuming the Asian male we are talking about here has started with an average height, and therefore is unlikely to have their health insurance pay for HGH). > lengthen your shins through surgery From what I hear, this costs between 50k - 150k USD, and six months to an year of being bedridden to recover. In addition, it might make your legs more fragile when doing squats or deadlifts. > (also do the obvious: take GLP-1 agonists) This is sane, and I would agree, if the person is overweight. > Alternatively, consider feminizing. So if Asian men are perceived to be relatively unmasculine, you want them to feminize themselves? This is a stupid and confused statement. I believe that what you mean is some sort of costly signalling via flamboyance, which does not necessarily feminize them as much as make them stand out and perhaps signal other things like having the wealth to invest in grooming and fashion, and having the social status to be able to stand out. Saying Asian men need to feminize reminds me of certain trans women's rather insistent attempt to normalize the idea of effeminate boys transitioning for social acceptance, which is an idea I find quite distasteful (its okay for boys to cry and to be weak, and I personally really dislike people and cultures that traumatize young men for not meeting the constantly escalating standards of masculinity). > Schedule more plastic surgeries in general. I see you expect people to have quite a lot of money to burn on fucking with their looks. I think I agree that plastic surgeries are likely a good investment for a young man with money burning a hole in their pocket and a face that they believe is suboptimal. Some young men truly are cursed with a face that makes me expect that no girl will find them sexually attractive, and I try to not think about it, in the same way that seeing a homeless person makes me anxious about the possibility of me being homeless and ruins the next five minutes of my life. > Don’t tell the people you’re sexually attracted to that you are doing this — that’s low status and induces guilt and ick. You can tell them the de facto truth while communicating it in a way that makes it have no effect on how you are perceived. > Don’t ask Reddit, they will tell you you are imagining things and need therapy. Redditoid morality tells you that it is valid and beautiful to want limb lengthening surgery if you start way below average and want to go to the average, but it is mental illness to want to go from average to above average. This also applies to you, and I think you've gone too far in the other direction. > Don’t be cynical or bitter or vengeful — do these things happily. Utterly ridiculous, don't tell people how to feel.
In this comment, I'll try to respond at the object level arguing for why I expect slower takeoff than "brain in a box in a basement". I'd also be down to try to do a dialogue/discussion at some point. > 1.4.1 Possible counter: “If a different, much more powerful, AI paradigm existed, then someone would have already found it.” > > I think of this as a classic @paulfchristiano-style rebuttal (see e.g. Yudkowsky and Christiano discuss "Takeoff Speeds", 2021). > > In terms of reference class forecasting, I concede that it’s rather rare for technologies with extreme profit potential to have sudden breakthroughs unlocking massive new capabilities (see here), that “could have happened” many years earlier but didn’t. But there are at least a few examples, like the 2025 baseball “torpedo bat”, wheels on suitcases, the original Bitcoin, and (arguably) nuclear chain reactions.[7] I think the way you describe this argument isn't quite right. (More precisely, I think the argument you give may also be a (weaker) counterargument that people sometimes say, but I think there is a nearby argument which is much stronger.) Here's how I would put this: Prior to having a complete version of this much more powerful AI paradigm, you'll first have a weaker version of this paradigm (e.g. you haven't figured out the most efficient way to do the brain algorithmic etc). Further, the weaker version of this paradigm might initially be used in combination with LLMs (or other techniques) such that it (somewhat continuously) integrates into the old trends. Of course, large paradigm shifts might cause things to proceed substantially faster or bend the trend, but not necessarily. Further, we should still broadly expect this new paradigm will itself take a reasonable amount of time to transition through the human range and though different levels of usefulness even if it's very different from LLM-like approaches (or other AI tech). And we should expect this probably happens at massive computational scale where it will first be viable given some level of algorithmic progress (though this depends on the relative difficulty of scaling things up versus improving the algorithms). As in, more than a year prior to the point where you can train a superintelligence on a gaming GPU, I expect someone will train a system which can automate big chunks of AI R&D using a much bigger cluster. On this prior point, it's worth noting that of the Paul's original points in Takeoff Speeds are totally applicable to non-LLM paradigms as is much in Yudkowsky and Christiano discuss "Takeoff Speeds". (And I don't think you compellingly respond to these arguments.) ---------------------------------------- I think your response is that you argue against these perspectives under 'Very little R&D separating “seemingly irrelevant” from ASI'. But, I just don't find these specific arguments very compelling. (Maybe also you'd say that you're just trying to lay out your views rather than compellingly arguing for them. Or maybe you'd say that you can't argue for your views due to infohazard/forkhazard concerns. In which case, fair enough.) Going through each of these: > I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains). How much is “very little”? I dunno, maybe 0–30 person-years of R&D? Contrast that with AI-2027’s estimate that crossing that gap will take millions of person-years of R&D. > > Why am I expecting this? I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above. I don't buy that having a "simple(ish) core of intelligence" means that you don't take a long time to get the resulting algorithms. I'd say that much of modern LLMs does have a simple core and you could transmit this using a short 30 page guide, but nonetheless, it took many years of R&D to reach where we are now. Also, I'd note that the brain seems way more complex than LLMs to me! > For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence My main response would be that basically all paradigms allow for mixing imitation with reinforcement learning. And, it might be possible to mix the new paradigm with LLMs which would smooth out / slow down takeoff.  You note that imitation learning is possible for brains, but don't explain why we won't be able to mix the brain like paradigm with more imitation than human brains do which would smooth out takeoff. As in, yes human brains doesn't use as much imitation as LLMs, but they would probably perform better if you modified the algorthm some and did do 10^26 FLOP worth of imitation on the best data. This would smooth out the takeoff. > Why do I think getting to “relevant at all” takes most of the work? This comes down to a key disanalogy between LLMs and brain-like AGI, one which I’ll discuss much more in the next post. I'll consider responding to this in a comment responding to the next post. Edit: it looks like this is just the argument that LLM capabilities come from imitation due to transforming observations into behavior in a way humans don't. I basically just think that you could also leverage imitation more effectively to get performance earlier (and thus at a lower level) with an early version of more brain like architecture and I expect people would do this in practice to see earlier returns (even if the brain doesn't do this). > Instead of imitation learning, a better analogy is to AlphaZero, in that the model starts from scratch and has to laboriously work its way up to human-level understanding. Noteably, in the domains of chess and go it actually took many years to make it through the human range. And, it was possible to leverage imitation learning and human heuristics to perform quite well at Go (and chess) in practice, up to systems which weren't that much worse than humans. > it takes a lot of work to get AlphaZero to the level of a skilled human, but then takes very little extra work to make it strongly superhuman. AlphaZero exhibits returns which are maybe like 2-4 SD (within the human distribution of Go players supposing ~100k to 1 million Go players) per 10x-ing of compute.[1] So, I'd say it probably would take around 30x to 300x additional compute to go from skilled human (perhaps 2 SD above median) to strongly superhuman (perhaps 3 SD above the best human or 7.5 SD above median) if you properly adapted to each compute level. In some ways 30x to 300x is very small, but also 30x to 300x is not that small... In practice, I expect returns more like 1.2 SD / 10x of compute at the point when AIs are matching top humans. (I explain this in a future post.) > 1.7.2 “Plenty of room at the top” I agree with this. > 1.7.3 What’s the rate-limiter? > > > [...] > > My rebuttal is: for a smooth-takeoff view, there has to be some correspondingly-slow-to-remove bottleneck that limits the rate of progress. In other words, you can say “If Ingredient X is an easy huge source of AGI competence, then it won’t be the rate-limiter, instead something else will be”. But you can’t say that about every ingredient! There has to be a “something else” which is an actual rate-limiter, that doesn’t prevent the paradigm from doing impressive things clearly on track towards AGI, but that does prevent it from being ASI, even after hundreds of person-years of experimentation.[13] And I’m just not seeing what that could be. > > Another point is: once people basically understand how the human brain figures things out in broad outline, there will be a “neuroscience overhang” of 100,000 papers about how the brain works in excruciating detail, and (I claim) it will rapidly become straightforward to understand and integrate all the little tricks that the brain uses into AI, if people get stuck on anything. I'd say that the rate limiter is that it will take a while to transition from something like "1000x less compute efficient than the human brain (as in, it will take 1000x more compute than human lifetime to match top human experts but simultaneously the AIs will be better at a bunch of specific tasks)" to "as compute efficient as the human brain". Like, the actual algorithmic progress for this will take a while and I don't buy your claim that that way this will work is that you'll go from nothing to having an outline of how the brain works and at this point everything will immediately come together due to the neuroscience literature. Like, I think something like this is possible, but unlikely (especially prior to having AIs that can automate AI R&D). And, while you have much less efficient algorithms, you're reasonably likely to get bottlenecked on either how fast you can scale up compute (though this is still pretty fast, especially if all those big datacenters for training LLMs are still just lying around around!) or how fast humanity can produce more compute (which can be much slower). Part of my disagreement is that I don't put the majority of the probability on "brain-like AGI" (even if we condition on something very different from LLMs) but this doesn't explain all of the disagreement. 1. ^ It looks like a version of AlphaGo Zero goes from 2400 ELO (around 1000th best human) to 4000 ELO (somewhat better than the best human) between hours 15 to 40 of training run (see Figure 3 in this PDF). So, naively this is a bit less than 3x compute for maybe 1.9 SDs (supposing that the “field” of Go players has around 100k to 1 million players) implying that 10x compute would get you closer to 4 SDs. However, in practice, progress around the human range was slower than 4 SDs/OOM would predict. Also, comparing times to reach particular performances within a training run can sometimes make progress look misleadingly fast due to LR decay and suboptimal model size. The final version of AlphaGo Zero used a bigger model size and ran RL for much longer, and it seemingly took more compute to reach the ~2400 ELO and ~4000 ELO which is some evidence for optimal model size making a substantial difference (see Figure 6 in the PDF). Also, my guess based on circumstantial evidence is that the original version of AlphaGo (which was initialized with imitation) moved through the human range substantially slower than 4 SDs/OOMs. Perhaps someone can confirm this. (This footnote is copied from a forthcoming post of mine.)
Load More

Recent Discussion

2.1 Summary & Table of contents

This is the second of a two-post series on foom (previous post) and doom (this post).

The last post talked about how I expect future AI to be different from present AI. This post will argue that this future AI will be of a type that will be egregiously misaligned and scheming, not even ‘slightly nice’, absent some future conceptual breakthrough.

I will particularly focus on exactly how and why I differ from the LLM-focused researchers who wind up with (from my perspective) bizarrely over-optimistic beliefs like “P(doom) ≲ 50%”.[1]

In particular, I will argue that these “optimists” are right that “Claude seems basically nice, by and large” is nonzero evidence for feeling good about current LLMs (with various caveats). But I think that future AIs...

This is quite specific and only engaging with section 2.3 but it made me curious. 

I want to ask a question around a core assumption in your argument about human imitative learning. You claim that when humans imitate, this "always ultimately arises from RL reward signals" - that we imitate because we "want to," even if unconsciously. Is this the case at all times though? 

Let me work through object permanence as a concrete case study. The standard developmental timeline shows infants acquiring this ability around 8-12 months through gradual exposur... (read more)

1Joey Marcellino
It's not obvious to me that "magically transmuting observations into behavior" is actually all that disanalogous to how the brain works. On something like the Surfing Uncertainty theory of the brain, updating probability distributions and minimizing predictive error is all the brain is ever doing, including potentially for things like moving your hand.
2Steven Byrnes
Well then so much the worse for “the Surfing Uncertainty theory of the brain”!  :) See my post Why I’m not into the Free Energy Principle, especially §8: It’s possible to want something without expecting it, and it’s possible to expect something without wanting it.
1S. Alex Bradt
And yet... But hallucination is "anything in human brains," isn't it?
No77e10

Two decades don't seem like enough to generate the effect he's talking about. He might disagree though.

1.1 Series summary and Table of Contents

This is a two-post series on AI “foom” (this post) and “doom” (next post).

A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...

1Valentin2026
Can you expand your argument why LLM will not reach AGI? Like, what exactly is the fundamental obstacle they will never pass? So far they successfully doing longer and longer (for humans ) tasks https://benjamintodd.substack.com/p/the-most-important-graph-in-ai-right I neither can see why in a few generations LLM won't be able to run a company, as you suggested. Moreover, I don't see why it is necessary to get to AGI. LLM are already good at solving complicated, Ph.D. level mathematical problems, which improves. Essentially, we just need an LLM version of AI researcher. To create ASI you don't need a billion of Sam Altmans, you need a billion of Ilya Sutskevers. Is there any reason to assume LLM will never be able to become an excellent AI researcher? 

LLM are already good at solving complicated, Ph.D. level mathematical problems, which improves

They're not. I work a lot with math, and o3 is useful for asking basic questions about domains I'm unfamiliar with and pulling up relevant concepts/literature. But if you ask it to prove something nontrivial, 95+% of the time it will invite you for a game of "here's a proof that 2 + 2 = 5, spot the error!".

That can also be useful: it's like dropping a malfunctioning probe into a cave and mapping out its interior off of the random flashes of light and sounds of imp... (read more)

3Knight Lee
I strongly agree with this post, but one question: Assuming there exists a simple core of intelligence, then that simple core is probably some kind of algorithm. When LLMs learn to predict the next token of a very complex process (like computer code or human thinking), they fit very high level patterns, and learn many algorithms (e.g. addition, multiplication, matrix multiplication, etc.) as long as those algorithms predict the next token well in certain contexts. Now maybe the simple core of intelligence, is too complex an algorithm to be learned when predicting a single next token. However, a long chain-of-thought can combine these relatively simple algorithms (for predicting one next token) in countless possible ways, forming tons of more advanced algorithms, with a lot of working memory. Reinforcement learning on the chain-of-thought, can gradually discover the best advanced algorithms for solving a great variety of tasks (any task which is cheaply verifiable). Given that evolution used brute force to create the human brain, don't you think it's plausible for this RL loop to use brute force to rediscover the simple core of intelligence? PS: This is just a thought, not a crux. It doesn't conflict with your conclusions, since LLM AGI being a possibility doesn't mean non-LLM AGI isn't a possibility. And even if the simple core of intelligence was discovered by RL of LLMs, the consequences may be the same.
2ryan_greenblatt
I agree there is a real difference, I just expect it to not make much of a difference to the bottom line in takeoff speeds etc. (I also expect some of both in the short timelines LLM perspective at the point of full AI R&D automation.) fMy view is that on hard tasks humans would also benefit from stuff like building explicit training data for themselves, especially if they had the advantage of "learn once, deploy many". I think humans tend to underinvest in this sort of thing. In the case of things like restaurant sim, the task is sufficiently easy that I expect AGI would probably not need this sort of thing (though it might still improve performance enough to be worth it). I expect that as AIs get smarter (perhaps beyond the AGI level) they will be able to match humans at everything without needing to do explicit R&D style learning in cases where humans don't need this. But, this sort of learning might still be sufficiently helpful that AIs are ongoingly applying it in all domains where increased cognitive performance has substantial returns. Sure, but we can still loosely evaluate sample efficiency relative to humans in cases where some learning (potentially including stuff like learning on the job). As in, how well can the AI learn from some some data relative to humans. I agree that if humans aren't using learning in some task then this isn't meaningful (and this distinction between learning and other cognitive abilities is itself a fuzzy distinction).

Edition #9, that School is Hell, turned out to hit quite the nerve.

Thus, I’m going to continue with the system of making the roundups have more focus in their themes, with this one being the opposite of school questions, except for the question of banning phones in schools which seemed to fit.

Table of Contents

  1. Metal Health.
  2. Coercion.
  3. Game Theoretically Sound Discipline.
  4. The Joy of Doing Nothing.
  5. ADHD Exists But So Do Boys.
  6. Sports Go Sports.
  7. On the Big Screen.
  8. Kids Media Is Often Anti-Capitalist Propaganda.
  9. Culture.
  10. Travel.
  11. Phone a Friend.
  12. The Case For Phones.
  13. Ban Cell Phones in Schools.
  14. A Sobering Thought.

Metal Health

Henry Shevlin: I asked a high school teacher friend about the biggest change in teens over the past decade. His answer was interesting. He said whereas the ‘default state’ of teenage psychology used to be boredom, now it was

...

On references: I find it baffling how much of a cultural disconnect I feel between myself (born 1987) and almost anyone <~5-8 yrs younger than me. I can easily have conversations with people in their 70s and get at least a majority of their references, but go just a few years in the other direction and (for a recent example) I'll talk to a coworker who not only had never seen Seinfeld but had never heard of the Soup Nazi. Or (for another) a trivia night where the hosts not only didn't know Anaconda sampled Baby Got Back but were somehow confused by the ... (read more)

In an attempt to get myself to write more here is my own shortform feed. Ideally I would write something daily, but we will see how it goes.

Of course this is now used as an excuse to revert any recent attempts to improve the article.

From reading the relevant talk-page it is pretty clear those arguing against the changes on these bases aren’t exactly doing so in good faith, and if they did not have this bit of ammunition to use they would use something else, but then with fewer detractors (since clearly nobody else followed or cared about that page).

3Lucius Bushnaq
I am glad that you are proud of it and I feel kind of bad saying this, but the reason I had mixed feelings about the promotion is that I just really don't like the design. I find it visually exhausting to look at. Until you added the option to disable the theme, I was just avoiding the LW front page. I don't like the design of https://ifanyonebuildsit.com/ either. 

I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here’s one point that I think is important.

My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.

On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the...

1jsnider3
Have you taken a look at how companies manage Claude Code, Cursor, etc? That seems related.
4RobertM
That's surprising and concerning.  As you say, if these companies expect their AIs to do end-to-end engineering and R&D tasks internally, it seems difficult to imagine how they could do that without having employee-level privileges.  Any place where they don't is a place where humans turn into a bottleneck.  I can imagine a few possible objections to this: 1. They don't expect those bottlenecks to impose that much of a penalty. 1. I'm not sure how this could be true unless you think that AI systems will be pretty weak, which is sort of fighting the premise.  What are we worried about, again? 2. They expect the bottlenecks to impose a large penalty and plan on biting that bullet when the time comes. 1. I currently roll to disbelieve based on both the publicly-held positions of the relevant organizations and also their historical track records.  Incentives here seem quite bad. 3. They plan on trying to thread the needle by employing some control schemes where (for example) different "agents" have different permissions.  i.e. a "code writing" agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes.  Another set of agents have permissions to review pull requests, and then request changes, approve/merge them, or flag the PR as suspicious.  Yet another set of agents act as gatekeepers to sensitive data that might be needed for some experiments but only with good justification. 1. This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents.  Not only will it probably work better, but it'll be straight up cheaper and easier! Like, to be clear, I would definitely prefer a world where these organizations wrote "small and carefully locked-down infrastructure" as the limited surface their AIs were allowed to interact with; I just don't expect that to actually happen in pr
2faul_sname
The same is true of human software developers - your dev team sure can ship more features at a faster cadence if you give them root on your prod servers and full read and write access to your database. However, despite this incentive gradient, most software shops don't look like this. Maybe the same forces that push current organizations to separate out the person writing the code from the person reviewing it could be repurposed to software agents. One bottleneck, of course, is that one reason it works with humans is that we have skin in the game - sufficiently bad behavior could get us fired or even sued. Current AI agents don't currently have anything to gain from behaving well or lose from behaving badly (or sufficient coherence to talk about "an" AI agent doing a thing).

one reason it works with humans is that we have skin in the game

 

Another reason is that different humans have different interests, your accountant and your electrician would struggle to work out a deal to enrich themselves at your expense, but it would get much easier if they shared the same brain and were just pretending to be separate people.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
32evhub
Why red-team models in unrealistic environments? Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points: 1. Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude's possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that's not an excuse—we still don't want Claude to blackmail/leak/spy/etc. even in such a situation! 2. The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get

"We don't want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional!"  

This is interesting!  I guess that in, some sense, means that you see certain ways in which even a future Claude N+1 won't be a truly general intelligence?

4Fabien Roger
I think this is not totally obvious a priori. Some jailbreaks may work not via direct instructions, but by doing things that erode the RLHF persona and then let the base shine through. For such "jailbreaks", you would expect the model to act misaligned if you hinted at misalignment very strongly regardless of initial misalignment. I think you can control for this by doing things like my "hint at paperclip" experiment (which in fact suggests that the snitching demo doesn't work just because of RLHF-persona-erosion), but I don't think it's obvious a priori. I think it would be valuable to have more experiments that try to disentangle which personality traits the scary demo reveals stem from the hints vs are "in the RLHF persona".

[ Context: The Debate on Animal Consciousness, 2014 ]

There's a story in Growing Up Yanomamo where the author, Mike Dawson, a white boy from America growing up among Yanomamö hunter-gatherer kids in the Amazon, is woken up in the early morning by two of his friends.

One of the friends says, "We're going to go fishing".

So he goes with them.

At some point on the walk to the river he realizes that his friends haven't said whose boat they'll use [ they're too young to have their own boat ].

He considers asking, then realizes that if he asks, and they're planning to borrow an older tribesmember's boat without permission [ which is almost certainly the case, given that they didn't specify up front ], his friends will have to...

TAG20

Yet trying to imagine being something with half as much consciousness or twice as much consciousness as myself, seems impossible

To me, it doesn't even need to be imagined. Everyone experienced partial consciousness, e.g.

  • Dreaming, where you have phenomenal awareness , but not of an external world.

  • Deliberate visualisation, which is less phenomenally vivid than perception in most people.

  • Drowsiness, states between sleep.and waking.

  • Autopilot and flow states , where the sense of a self deciding actions isn absent.

More rarely there are forms of ... (read more)

Salutations,

I have been a regular reader (and big fan) of LessWrong for quite some time now, so let me just say that I feel honoured to be able to share some of my thoughts with the likes of you folks.

I don't reckon myself a good writer, nor a very polished thinker (as many of the veteran writers here), so I hope you'll bear with me and be gentle with your feedback (it is my first time after all).

Without further ado, I have been recently wrestling with the concept of abductive reasoning. I have been perusing for good definitions and explanations of it, but none persuade me that abductive reasoning is actually a needed concept.

The argument goes as follows: “Any proposed instance of abductive reasoning can be fully...

Abductive reasoning results from the abduction of one's reason.

Couldn't resist the quip. To speak more seriously: There is deduction, which from true premises always yields true conclusions. There is Bayesian reasoning, which from probabilities derives probabilities. There is no other form of reasoning. "Induction" and "abduction" are pre-Bayesian gropings in the dark, of no more account than the theory of humours in medicine.