In their hopes that it’s not too late for course correction around AI, Nate and Eliezer have written a book making the detailed case for this unfortunate reality. Available in September, you can preorder it now, or read endorsements, quotes, and reviews from scientists, national security officials, and more.

Customize
Variable "$selector" got invalid value { ref: "commoncog.com", filterSettings: { personalBlog: "Hidden", tags: [Array] }, after: "2025-03-27T07:00:00.000Z", forum: true } at "selector.magic"; Field "ref" is not defined by type "PostsMagicInput". Did you mean "af"?
evhubΩ297522
19
Why red-team models in unrealistic environments? Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points: 1. Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude's possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that's not an excuse—we still don't want Claude to blackmail/leak/spy/etc. even in such a situation! 2. The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get
leogao18074
7
some lessons from ml research: * any shocking or surprising result in your own experiment is 80% likely to be a bug until proven otherwise. your first thought should always be to comb for bugs. * only after you have ruled out bugs do you get to actually think about how to fit your theory to the data, and even then, there might still be a hidden bug. * most papers are terrible and don't replicate. * most techniques that sound intuitively plausible don't work. * most techniques only look good if you don't pick a strong enough baseline. * an actually good idea can take many tries before it works. * once you have good research intuitions, the most productive state to be in is to literally not think about what will go into the paper and just do experiments that satisfy your curiosity and. convince yourself that the thing is true. once you have that, running the final sweeps is really easy * most people have no intuition whatsoever about their hardware and so will write code that is horribly inefficient. even learning a little bit about hardware fundamentals so you don't do anything obviously dumb is super valuable * in a long and complex enough project, you will almost certainly have a bug that invalidates weeks (or months) of work. being really careful and testing helps but slows down velocity a lot. unclear what the right equilibrium is. * feedback loop time is incredibly important, if you can get rapid feedback, you will make so much more progress. * implementing something that is already known to work is always vastly easier than inventing/researching something new. * you will inevitably spend a lot of time doing things that have no impact on the final published work whatsoever. like not even contributing that much useful intuition. this is unfortunate but unavoidable * oftentimes you will spend a lot of time being fundamentally philosophically confused about what to do, and only really figure out halfway through the project. this is normal. * directio
Eli Tyre*1011
1
In Spring of 2024, Jacob Lagerros and I took an impromptu trip to Taiwan to glean what we could about the Chip supply chain. Around the same time, I read Chip War and some other sources about the semiconductor industry. I planned to write a blog post outlining what I learned, but I got pseudo-depressed after coming back from Taiwan, and never finished or published it. This post is a lightly edited version of the draft that has been sitting in my documents folder. (I had originally intended to include a lot more than this, but I might as well publish what I have.) Interestingly, reading it now, all of this feels so basic, that I’m surprised that I considered a lot of it worth including in a post like this, but I think it was all new to me at the time. * There are important differences between logic chips and memory chips, such that at various times, companies have specialized in one or the other. * TSMC was founded by Morris Chang, with the backing of the Taiwanese government. But the original impetus came from Taiwan, not from Chang. The government decided that it wanted to become a leading semiconductor manufacturer, and approached Chang (who had been an engineer and executive at Texas instruments) about leading the venture. * However, TSMC’s core business model, being a designerless fab that would manufacture chips for customers, but not designing chips of its own, was Chang’s idea. He had floated it to Texas instruments while he worked there, and was turned down. This idea was bold and innovative at the time—there had never been a major fab that didn’t design its own chips. * There had been precursors on the customer side: small computer firms that would design chips and then buy some of the spare capacity of Intel or Texas Instruments to manufacture them. This was always a precarious situation, for those companies, because they depended on companies who were both their competitors and their crucial suppliers. Chang bet that there would be more compa
Take: Exploration hacking should not be used as a synonym for deceptive alignment.  (I have observed one such usage) Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging "intermediate" capabilities during RL training to avoid learning a "full" capability. 
habryka*6935
11
Ok, many of y'all can have feelings about whether it's a good idea to promote Nate's and Eliezer's book on the LW frontpage the way we are doing it, but can I get some acknowledgement that the design looks really dope?  Look at those nice semi-transparent post-items. Look at that nice sunset gradient that slowly fades to black. Look at the stars fading out in an animation that is subtle enough that you can (hopefully) ignore it as you scroll down and parse the frontpage, but still creates an airy ominous beauty to live snuffing out across the universe.  Well, I am proud of it :P[1]  I hope I can do more cool things with the frontpage for other things in the future. I've long been wanting to do things with the LW frontpage that create a more aesthetic experience that capture the essence of some important essay or piece of content I want to draw attention to, and I feel like this one worked quite well.  I'll probably experiment more with some similar things in the future (though I will generally avoid changing the color scheme this drastically unless there is some good reason, and make sure people can easily turn it off from the start). 1. ^ (Also credit to Ray who did the initial pass of porting over the design from ifanyonebuildsit.com)

Popular Comments

I respect the courage in posting this on LessWrong and writing your thoughts out for all to hear and evaluate and judge you for. It is why I've decided to go out on a limb and even comment. > take steroids Taking steroids usually leads to a permanent reduction of endogenous testosterone production, and infertility. I think it is quite irresponsible for you to recommend this, especially on LW, without the sensible caveats. > take HGH during critical growth periods Unfortunately, this option is only available for teenagers with parents who are rich enough to be willing to pay for this (assuming the Asian male we are talking about here has started with an average height, and therefore is unlikely to have their health insurance pay for HGH). > lengthen your shins through surgery From what I hear, this costs between 50k - 150k USD, and six months to an year of being bedridden to recover. In addition, it might make your legs more fragile when doing squats or deadlifts. > (also do the obvious: take GLP-1 agonists) This is sane, and I would agree, if the person is overweight. > Alternatively, consider feminizing. So if Asian men are perceived to be relatively unmasculine, you want them to feminize themselves? This is a stupid and confused statement. I believe that what you mean is some sort of costly signalling via flamboyance, which does not necessarily feminize them as much as make them stand out and perhaps signal other things like having the wealth to invest in grooming and fashion, and having the social status to be able to stand out. Saying Asian men need to feminize reminds me of certain trans women's rather insistent attempt to normalize the idea of effeminate boys transitioning for social acceptance, which is an idea I find quite distasteful (its okay for boys to cry and to be weak, and I personally really dislike people and cultures that traumatize young men for not meeting the constantly escalating standards of masculinity). > Schedule more plastic surgeries in general. I see you expect people to have quite a lot of money to burn on fucking with their looks. I think I agree that plastic surgeries are likely a good investment for a young man with money burning a hole in their pocket and a face that they believe is suboptimal. Some young men truly are cursed with a face that makes me expect that no girl will find them sexually attractive, and I try to not think about it, in the same way that seeing a homeless person makes me anxious about the possibility of me being homeless and ruins the next five minutes of my life. > Don’t tell the people you’re sexually attracted to that you are doing this — that’s low status and induces guilt and ick. You can tell them the de facto truth while communicating it in a way that makes it have no effect on how you are perceived. > Don’t ask Reddit, they will tell you you are imagining things and need therapy. Redditoid morality tells you that it is valid and beautiful to want limb lengthening surgery if you start way below average and want to go to the average, but it is mental illness to want to go from average to above average. This also applies to you, and I think you've gone too far in the other direction. > Don’t be cynical or bitter or vengeful — do these things happily. Utterly ridiculous, don't tell people how to feel.
This feels kind of backwards, in the sense that I think something like 2032-2037 is probably the period that most people I know who have reasonably short timelines consider most likely.  AI 2027 is a particularly aggressive timeline compared to the median, so if you choose 2028 as some kind of Schelling time to decide whether things are markedly slower than expected then I think you are deciding on a strategy that doesn't make sense by like 80% of the registered predictions that people have. Even the AI Futures team themselves have timelines that put more probability mass on 2029 than 2027, IIRC.  Of course, I agree that in some worlds AI progress has substantially slowed down, and we have received evidence that things will take longer, but "are we alive and are things still OK in 2028?" is a terrible way to operationalize that. Most people do not expect anything particularly terrible to have happened by 2028! My best guess, though I am far from confident, is that things will mostly get continuously more crunch-like from here, as things continue to accelerate. The key decision-point in my model at which things might become a bit different is if we hit the end of the compute overhang, and you can't scale up AI further simply by more financial investment, but instead now need to substantially ramp up global compute production, and make algorithmic progress, which might markedly slow down progress. I agree with a bunch of other things you say about it being really important to have some faith in humanity, and to be capable of seeing what a good future looks like even if it's hard, and that this is worth spending a lot of effort and attention on, but just the "I propose 2028 as the time to re-evaluate things, and I think we really want to change things if stuff still looks fine" feels to me like it fails to engage with people's actually registered predictions.
Four million a year seems like a lot of money to spend on what is essentially a good capabilities benchmark. I would rather give that to like, LessWrong, and if I had the time to do some research I could probably find 10 people willing to create benchmarks for alignment that I think would be even more positively impactful than a lesswrong donation (like https://scale.com/leaderboard/mask or https://scale.com/leaderboard/fortress)
Load More

Recent Discussion

This post was written as part of AISC 2025.

Introduction

In Agents, Tools, and Simulators, we outlined several lenses for conceptualizing LLM-based AI, with the intention of defining what simulators are in contrast to their alternatives.  This post will consider the alignment implications of each lens in order to establish how much we should care if LLMs are simulators or not.  We conclude that agentic systems seem to have a high potential for misalignment, simulators have a mild to moderate risk, tools push the dangers elsewhere, and the potential for blended paradigms muddies this evaluation.

Alignment for Agents

The basic problem of AI alignment, under the agentic paradigm, can be summarized as follows:

  1. Optimizing any set of values pushes those values not considered to zero.
  2. People care about more things than we can
...

Agree, when discussing the alignment of simulators in this post, we are referring to safety from the subset of dangers related to unbounded optimization towards alien goals, which does not include everything within value alignment, let alone AI safety.  But this qualification points to a subtle meaning drift in use of the word "alignment" in this post (towards something like "comprehension and internalization of human values") which isn't good practice and something I'll want to figure out how to edit/fix soon.

In Agents, Tools, and Simulators we outlined what it means to describe a system through each of these lenses and how they overlap.  In the case of an agent/simulator, our central question is: which property is "driving the bus" with respect to the system's behavior, utilizing the other in its service?

Aligning Agents, Tools, and Simulators explores the implications of the above distinction, predicting different types of values (and thus behavior) from

  1. An agent that contains a simulation of the world that it uses to navigate.
  2. A simulator that generates agents because such agents are part of the environment the system is modelling.
  3. A system where the modes are so entangled it is meaningless to think about where one ends and the other begins.  

Specifically, we expect simulator-first systems to have holistic goals...

[Crossposted on Windows On Theory]
 

Throughout history, technological and scientific advances have had both good and ill effects, but their overall impact has been overwhelmingly positive. Thanks to scientific progress, most people on earth live longer, healthier, and better than they did centuries or even decades ago. 

I believe that AI (including AGI and ASI) can do the same and be a positive force for humanity. I also believe that it is possible to solve the “technical alignment” problem and build AIs that follow the words and intent of our instructions and report faithfully on their actions and observations.

I will not defend these two claims here. However, even granting these optimistic premises, AI’s positive impact is not guaranteed. In this essay, I will:

  1. Define what I mean by “solving” the
...

On the "safety underinvestment" point I am going to say something which I think is obvious (and has probably been discussed before) but I have not personally seen anyone advocate for:

The DoD should be conducting its own safety/alignment research, and pushing for this should be one of the primary goals of safety advocates.

There is this constant push for labs to invest more in safety. At the same time we all acknowledge this is a zero sum game. Every dollar/joule/flop they put into safety/alignment is a dollar/joule/flop they can't put into capabilities rese... (read more)

2ryan_greenblatt
I left some comments noting disagreements, but I thought it be helpful to note some areas of agreement: * I agree AI could be (highly) risky and think that it's good to acknowledge this (as you do). * I agree that you could have a maximally obedient superintelligent AI. (Some cases around manipulation could be philosophically trick, but this seems resolvable at least in principle.) * I agree that obedience (instruction following) is a good target, though I think there are some caveats to this (and uncertainties I have). * I agree there are substantial risks from arms races (which cause a race to the bottom on safety), AI enabled authoritarianism (which could be ~indefinitely stable), and undesirable societal shifts due to unemployment and general upheaval. I'm most worried about misalignment risks, though these risks might be increased by these other factors, particularly arms races. * I agree that pure internal deployment, single monopoly, and safety underinvestment would (probably) make these risks worse. Though I might think of the downsides of these as somewhat different than what you're focusing on. * I agree that probably offense-defense issue can be handled if most intelligence is in the hands of good actors, and careful (and potentially quite strong) actions are taken to defend properly. (I'm maybe skeptical that sufficiently strong actions will be taken in practice for reasons similar to those discussed here, though I don't agree with the bottom line of this linked post overall.)
7ryan_greenblatt
I think it's pretty plausible that AIs which are obviously pretty misaligned are deployed (e.g. we've caught them trying to escape or sabotage research, or maybe we've caught an earlier model doing this and we don't have any strong reason to think we've resolved this in the current system). This is made more likely by an aggressive race to the bottom (possibly caused by an arms race over AI as you discuss). Part of my perspective here is that misalignment issues might be very difficult to resolve in time due to rapid AI progress and difficulty studying misalignment. (For instance, because the very AIs you're studying might not want to be studied!). I also think it's plausible that very capable AIs will end up being schemers/alignment-fakers which look very aligned (sufficient for your deployment checks), but have misaligned long run aims. And, even if you found evidence of this in earlier AIs, this wouldn't suffice to prevent AIs where we haven't confidently ruled this out from being deployed (see prior paragraph). I also think it's plausible that you won't see smoking gun evidence of this before it's too late as I discuss in a prior post of mine. The issues I'm worried about don't feel like edge cases or an uncanny valley to me (though I suppose you could think of alignment faking as an uncanny valley). My understanding is that you disagree about the possibility of relatively worst case scenarios with respect to scheming and think that people would radically change their approach if we had clear evidence of (relatively consistent, across context) scheming without a strong resolution of this problem that generalizes to more capable AIs. I hope you're right. ---------------------------------------- To be clear, I agree that it would be better if AIs are obviously (seriously) misaligned than if they are instead scheming undetected.
4ryan_greenblatt
I agree that to train AIs which are generally very superhuman, you'll need to be able to make AIs highly capable on tasks that are too hard for humans to supervise. And, that if we have no ability to make AIs capable on tasks which are hard for humans to supervise, risks are much more limited.[1] However, I don't think that making AIs highly capable on tasks which are too hard for humans to supervise necessarily requires being able to ensure AIs do what we want in these settings nor does it require being able to train AIs for specific objectives in these settings. Instead, you could in principle create very superhuman AIs through transfer (as humans do in many cases) and this wouldn't require any ability to directly supervise on domains where the AI ends up being superhuman nevertheless. Further, you might be able to directly train AIs to be highly capable (as in, without depending on much transfer) using flawed feedback in a given domain (e.g. feedback which is often possible to reward hack but which still teaches the AI the relevant abilities). So, I agree that the ability to make very superhuman AIs implies that we'll (very likely) be able to make AIs which are capable of following our instructions and which are capable of being maximally honest, but this doesn't imply that we'll be able to ensure these properties (e.g. the AI could intentionally disobey instructions or lie). Further, there is a difference between being able to supervise instruction following and honesty in any given task and being able to produce an AI which robustly instruction follows and is honest. (Things like online training using this supervision only give you average case guarantees, and that's if you actually use online training.) It's certainly possible that there is substantial transfer from the task of "train AIs to be highly capable (and useful) in harder-to-check domains" to the task of ensuring AIs are robustly honest and instruction following, but it is also easy to imagine wa

Crosspost from my blog.

Regime change

Conjecture: when there is regime change, the default outcome is for a faction to take over—whichever faction is best prepared to seize power by force.

One example: The Iranian Revolution of 1978-1979. In the years leading up to the revolution, there was turmoil and broad hostility towards the Shah, across many sectors of the population. These hostilities ultimately combined in an escalation of protest, crack-down, more protest from more sectors (protests, worker strikes). Finally, the popular support for Khomeini as the flag-bearer of the broad-based revolution was enough to get the armed forces to defect, ending the Shah's rule.

From the Britannica article on the aftermath:

On April 1, following overwhelming support in a national referendum, Khomeini declared Iran an Islamic republic. Elements within the clergy

...
2mishka
No, just about how to actually make non-saturating recursive self-improvement ("intelligence explosion"). Well, with the added constraint of not killing everyone or almost everyone, and not making almost everyone miserable either... (Speaking of which, I noticed recently that some of people's attempts at recursive self-improvement now take longer to saturate than before. And, in fact, they are taking long enough that people are sometimes publishing before pushing them to saturation, so we don't even know what would happen if they were to simply continue pushing a bit harder.) Now, implementing those things to test those ideas can actually be quite unsafe (that's basically "mini-foom" experiments, and people are not talking enough about safety of those). So before pushing harder in this direction, it would be better to do some preliminary work to reduce risks of such experiments... Yes, LLMs mostly have reliability/steerability problem. I am seeing plenty of "strokes of genius" in LLMs, so the potential is there. They are not "dumb", they have good creativity (in my experience). We just can't get them to reliably compose, verify, backtrack, and so on to produce the overall quality work. Their "fuzzy error correction" still works less well than human "fuzzy error correction", at least on some of the relevant scales. So they eventually accumulate too many errors on the long horizon tasks and they don't self-correct enough on the long horizon tasks. This sounds to me more like a "character upbringing problem" than a purely technical problem... That's especially obvious when one reads reasoning traces of reasoning models... What I see there sounds to me as if their "orientation" is still wrong (those models seem to be thinking about how to satisfy their user or their maker, and not about how to "do the right thing", whereas a good human solving a math problem ignores the aspect of satisfying their teacher or their parent, and just tries to do "an objectively good
2TsviBT
Examples? What makes you think they are strokes of genius (as opposed to the thing already being in the training data, or being actually easy)?
3mishka
I don't know. I started (my experience talking with GPT-4 and such) with asking it to analyze a 200 lines of non-standard code with comments stripped out. It correctly figured out that I was using nested dictionaries to represent vector-like objects, and that that was an implementation of a non-standard unusually flexible (but slow) neural machine. This was obviously the case of "true understanding" (and it was quite difficult to reproduce, as the models evolved the ability to analyze this code well was lost, then eventually regained in better models; those better models eventually figured even more non-trivial things about that non-standard implementation, e.g. at some point newer models started to notice on their own that that particular neural machine was inherently self-modifying; anyway, very obvious evolution from inept pattern matching to good understanding, with some setbacks during the evolution of models, but eventually with good progress towards better and better performance). Then I asked it to creatively modify and creatively remix some Shadertoy shaders, and it did a very good job (even more so if one considers that that model was visually blind and was unable to see the animations produced by its shaders). Nothing too difficult, but things like taking a function from one of the shaders and adding a call to this function from another shader with impressive visual effects... Again, with all the simplicity, it was more than would have occurred to me, if I were trying to do this manually... But when I tried to manually iterate these steps to obtain "evolution of interesting shaders", I got a rapid saturation, not an unlimited interesting evolution... So not bad at all (I occasionally do rather creative things, but it is always an effort, so on the occasions when I am unable to successfully do this kind of effort, I start to feel that the model might be more creative than "me in my usual mode" (although, I don't know if these models are already competi
TsviBT21

Ok, thanks for the info. (For the record, these do not sound like what I would remotely call "strokes of genius".)

I'll explain my reasoning in a second, but I'll start with the conclusion:

I think it'd be healthy and good to pause and seriously reconsider the focus on doom if we get to 2028 and the situation feels basically like it does today.

I don't know how to really precisely define "basically like it does today". I'll try to offer some pointers in a bit. I'm hoping folk will chime in and suggest some details.

Also, I don't mean to challenge the doom focus right now. There seems to be some good momentum with AI 2027 and the Eliezer/Nate book. I even preordered the latter.

But I'm still guessing this whole approach is at least partly misled. And I'm guessing that fact will show up in 2028 as "Oh, huh, looks...

Jiro20

What if the authors weren’t a subset of the community at all? What if they’d never heard of LessWrong, somehow?

Wouldn't that not change it very much, because the community signal-boosting a claim from outside the community still fits the pattern?

1LGS
Exactly! The frontier labs have the compute and incentive to push capabilities forward, while randos on lesswrong are instead more likely to study alignment in weak open source models
2Daniel Kokotajlo
It has indeed been really nice, psychologically, to have timelines that are lengthening again. 2020 to 2024 that was not the case.
3Random Developer
One of my key concerns is the question of: 1. Do the currently missing LLM abilities scale like pre-training, where each improvement requires spending 10x as much money? 2. Or do the currently missing abilities scale more like "reasoning", where individual university groups could fine-tune an existing model for under $5,000 in GPU costs, and give it significant new abilities? 3. Or is the real situation somewhere in between? Category (2) is what Bolstrom described as a "vulnerable world", or a "recipe for ruin." Also, not everyone believes that "alignment" will actually work for ASI. Under these assumptions, widely publishing detailed proposals in category (2) would seem unwise? Also, even I believed that someone would figure out the necessary insights to build AGI, it still matters how quickly they do it. Given a choice of dying of cancer in 6 months or 12 (all other things being equal), I would pick 12. (I really ought to make an actual discussion post on the right way to handle even "recipes for small-scale ruin." After September 11th, this was a regular discussion among engineers and STEM types. It turns out that there are some truly nasty vulnerabilities that are known to experts, but that are not widely known to the public. If these vulnerabilities can be fixed, it's usually better to publicize them. But what should you do if a vulnerability is fundamentally unfixable?)

Take: Exploration hacking should not be used as a synonym for deceptive alignment. 

(I have observed one such usage)

Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging "intermediate" capabilities during RL training to avoid learning a "full" capability. 

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

1.1 Series summary and Table of Contents

This is a two-post series on AI “foom” (this post) and “doom” (next post).

A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...

jdp100

1.3.1 Existence proof: the human cortex

So unfortunately this is one of those arguments that rapidly descends into which prior you should apply and how you should update on what evidence, but.

Your entire post basically hinges on this point and I find it unconvincing. Bionets are very strange beasts that cannot even implement backprop in the way we're used to, it's not remotely obvious that we would recognize known algorithms even if they were what the cortex amounted to. I will confess that I'm not a professional neuroscientist, but Beren Millidge is and... (read more)

6Baram Sosis
I'm a bit surprised that you view the "secret sauce" as being in the cortical algorithm. My (admittedly quite hazy) view is that the cortex seems to be doing roughly the same "type of thing" as transformers, namely, building a giant predictive/generative world model. Sure, maybe it's doing so more efficiently -- I haven't looked into all the various comparisons between LLM and human lifetime training data. But I would've expected the major qualitative gaps between humans and LLMs to come from the complete lack of anything equivalent to the subcortical areas in LLMs. (But maybe that's just my bias from having worked on basal ganglia modeling and not the cortex.) In this view, there's still some secret sauce that current LLMs are missing, but AGI will likely look like some extra stuff stapled to an LLM rather than an entirely new paradigm. So what makes you think that the key difference is actually in the cortical algorithm? (If one of your many posts on the subject already answers this question, feel free to point me to it)
2Raemon
I'm curious what's the argument that felt most like "oh"
1David Johnston
This piece combines relatively uncontroversial points with some justification ("we're not near the compute or data efficiency limit") with controversial claims justified only by Steven's intuition ("the frontier will be reached suddenly by a small group few people are tracking"). I'd be more interested in a piece which examined the consequences of the former kind of claims only, or more strongly justified the latter kinds of claims.

Greedy-Advantage-Aware RLHF addresses the negative side effects from misspecified reward functions problem in language modeling domains. In a simple setting, the algorithm improves on traditional RLHF methods by producing agents that have a reduced tendency to exploit misspecified reward functions. I also detect the presence of sharp parameter topology in reward hacking agents, which suggests future research directions. The repository for the project can be found here.

Motivation

In the famous short story The Monkey's Paw by W.W. Jacobs, the White family receives a well-traveled friend of theirs, Sergeant-Major Morris, and he brings with him a talisman from his visits to India: a mummified monkey's paw. Sergeant Major Morris reveals that the paw has a magical ability to grant wishes, but cautions against using its power. The family does...

Is it ok to compute the advantage function as a difference of Value functions?
To my understanding, the advantage in ppo is not simply the difference between value functions, but uses GAE, which depends on later value, reward terms of the sampled trajectory. 
Shouldn't we necessarily use that function during PPO training?