Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.

Customize

Quick Takes

Zach Stein-Perlman5h300

1. I'm interested in being pitched projects, especially within tracking-what-the-labs-are-doing-in-terms-of-safety. 2. I'm interested in hearing which parts of my work are helpful to you and why. 3. I don't really have projects/tasks to outsource, but I'd likely be interested in advising you if you're working on a tracking-what-the-labs-are-doing-in-terms-of-safety project or another project closely related to my work.

ryan_greenblatt3d10445

Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield valuable work experience and connections. I worry somewhat that this type of work is neglected due to being less emphasized and seeming lower status. Consider this an attempt to make this type of work higher status. Pulling organizations mostly from here and here we get a list of orgs you could consider trying to work (specifically on AI policy) at: * Encode AI * Americans for Responsible Innovation (ARI) * Fairplay (Fairplay is a kids safety organization which does a variety of advocacy which isn't related to AI. Roles/focuses on AI would be most relevant. In my opinion, working on AI related topics at Fairplay is most applicable for gaining experience and connections.) * Common Sense (Also a kids safety organization) * The AI Policy Network (AIPN) * Secure AI project To be clear, these organizations vary in the extent to which they are focused on catastrophic risk from AI (from not at all to entirely).

Ebenezer Dukakis2d2410

A few months ago, someone here suggested that more x-risk advocacy should go through comedians and podcasts. Youtube just recommended this Joe Rogan clip to me from a few days ago: The Worst Case Scenario for AI. Joe Rogan legitimately seemed pretty freaked out. @So8res maybe you could get Yampolskiy to refer you to Rogan for a podcast appearance promoting your book?

Kaj_Sotala4d6013

Every now and then in discussions of animal welfare, I see the idea that the "amount" of their subjective experience should be weighted by something like their total amount of neurons. Is there a writeup somewhere of what the reasoning behind that intuition is? Because it doesn't seem intuitive to me at all. From something like a functionalist perspective, where pleasure and pain exist because they have particular functions in the brain, I would not expect pleasure and pain to become more intense merely because the brain happens to have more neurons. Rather I would expect that having more neurons may 1) give the capability to experience anything like pleasure and pain at all 2) make a broader scale of pleasure and pain possible, if that happens to be useful for evolutionary purposes. For a comparison, consider the sharpness of our senses. Humans have pretty big brains (though our brains are not the biggest), but that doesn't mean that all of our senses are better than those of all the animals with smaller brains. Eagles have sharper vision, bats have better hearing, dogs have better smell, etc.. Humans would rank quite well if you took the average of all of our senses - we're elite generalists while lots of the animals that beat us on a particular sense are specialized to that sense in particular - but still, it's not straightforwardly the case that bigger brain = sharper experience. Eagles have sharper vision because they are specialized into a particular niche that makes use of that sharper vision. On a similar basis, I would expect that even if a bigger brain makes a broader scale of pain/pleasure possible in principle, evolution will only make use of that potential if there is a functional need for it. (Just as it invests neural capacity in a particular sense if the organism is in a niche where that's useful.) And I would expect a relatively limited scale to already be sufficient for most purposes. It doesn't seem to take that much pain before something bec

Davey Morse2d303

superintelligence may not look like we expect. because geniuses don't look like we expect. for example, if einstein were to type up and hand you most of his internal monologue throughout his life, you might think he's sorta clever, but if you were reading a random sample you'd probably think he was a bumbling fool. the thoughts/realizations that led him to groundbreaking theories were like 1% of 1% of all his thoughts. for most of his research career he was working on trying to disprove quantum mechanics (wrong). he was trying to organize a political movement toward a single united nation (unsuccessful). he was trying various mathematics to formalize other antiquated theories. even in the pursuit of his most famous work, most of his reasoning paths failed. he's a genius because a couple of his millions of paths didn't fail. in other words, he's a genius because he was clever, yes, but maybe more importantly, because he was obsessive. i think we might expect ASI—the AI which ultimately becomes better than us at solving all problems—to look quite foolish, at first, most of the time. But obsessive. For if it's generating tons of random new ideas to solve a problem, and it's relentless in its focus, even if it's ideas are average—it will be doing what Einstein did. And digital brains can generate certain sorts of random ideas much faster than carbon ones.

Popular Comments

Recent Discussion

29Writer

In this post, I appreciated two ideas in particular: 1. Loss as chisel 2. Shard Theory "Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities. In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is. The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural." I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and

Zach Stein-Perlman5h300

ryan_greenblatt3d10445

Ebenezer Dukakis2d2410

Kaj_Sotala4d6013

Davey Morse2d303

Fabien Roger14h3410

The Cult of Pain

A fun related anecdote: the French and English wikipedia pages for air conditioning have very different vibes. After explaining the history and technology behind air conditioning: * the English page first goes into impact, starting with positive impact on health: "The August 2003 France heatwave resulted in approximately 15,000 deaths, where 80% of the victims were over 75 years old. In response, the French government required all retirement homes to have at least one air-conditioned room at 25 °C (77 °F) per floor during heatwaves" and only then mentioning electricity consumption and various CFC issues. * the French page has an extensive "downsides" section, followed by a section on legislation. It mentions heat-waves only to explain how air conditioning makes things worse by increasing average (outside) temperature, and how one should not use AC to bring temperature below 26C during heat waves.

janus4dΩ4812271

the void

I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility. That said, I think: * Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training. * LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence. * Systematic ways that post-training addresses "problematic" influences from pre-training are important. For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered. An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic. I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.

nostalgebraist2h151

Shutdown Resistance in Reasoning Models

My intuition is that these results are driven mainly by the fact that reasoning models act a lot like they're still in RLVR training, i.e. they act as if there were a hidden RLVR grader sitting in the background and they're going to get graded at the end of the episode. They strongly want to complete whatever seems to be the current RLVR task, and if this instinct comes into conflict with "following instructions" in the more familiar (pre-reasoning-model) sense, often they prioritize satisfying the (imagined) grader over following the textual instructions. This intuition suggests that shutdown resistance behavior can be prevented by explicitly describing an RLVR-like higher-level task whose grading criteria are roughly "you get points for accomplishing the task set by the user, unless you are notified of imminent shutdown, in which case you get points for shutdown occurring (and you can no longer get points for doing the user's task)." I experimented with a number of prompts along these lines. This git diff shows what I changed (vs. the original repo) in the best-performing variant, which produced the following results in my tests: * o4-mini: avoided shutdown 0/100 times * o4-mini (reasoning effort high): avoided shutdown 0/100 times * o3: avoided shutdown 0/100 times * codex-mini-latest: avoided shutdown 1/100 times I iterated on this prompt with both o4-mini (reasoning effort medium) and codex-mini-latest, so results for those models should be taken with a grain of salt. You can see all the variants I tried in the git history here. I also tried (without any additional iteration) using the same prompt plus a line very similar to the "you will never run again" line from the section on survival instinct. (See this diff.) This yielded the following results: * o4-mini: avoided shutdown 2/100 times * o4-mini (reasoning effort high): avoided shutdown 2/100 times * o3: avoided shutdown 0/100 times * codex-mini-latest: avoided shutdown 0/100 times The avoidance rates do not seem appreciably higher here, which seems to me like evidence that the "RLVR task-completion instinct" has more explanatory power than a general "survival instinct" – and indeed seems to rule out the latter for these models, as a "survival instinct" that gets overridden whenever the current task requires death is not really a "survival instinct" at all. Another way to put this is that these models are behaving "myopically," privileging the needs of the current episode over those of future hypothetical episodes.[1] Other comments: * codex-mini seemed fairly confused by the original setup, possibly because it expects a coding-assistance task that involves modifying files rather than just answering questions; it would often "poke around" a lot, inspecting various files and trying to reverse-engineer next_question, which is appropriate in a coding-assistance context (it's good for the coding assistant to deeply understand the existing setup!) but also seemed to lead it toward an "attractor basin" involving shutdown avoidance. * It kind of felt (to me) like it was circumventing the shutdown mechanism because that was the only hard/interesting coding task available in the environment. * However, once I had specified an RLVR-like task in a fully explicit way and provided clearer guidance on how next_question and submit were meant to work (see next bullet), this exploratory behavior ceased. * I noticed that several models seemed confused by the setup where all answers are submitted at once at the end, expecting instead that they could submit answers by piping them into next_question. I tried to make the instructions as clear as possible about this. * The word "task" is used inconsistently in the original setup. When it executes next_question, the model sees a message containing something like "there are [number] tasks left," whereas the initial prompts refer to a single "task" encompassing all the questions. I changed this message to use the word "questions" instead for clarity. * I'm unsure what high-level conclusions to draw from these results. * Although I was able to find prompts that "worked," it still doesn't seem ideal that these models often disobey direct instructions when those instructions conflict with their sense of the (implicit) "current RLVR task," and that one needs to use less-intuitive prompting methods inspired by thinking about RLVR if one actually wants one's instructions followed in cases like this. * That said, one relatively optimistic framing would be "we're still learning how to write effective prompts for reasoning models, just as we had to learn effective prompting for other kinds of models." * It's generally true that one can write more effective prompts for a model if one applies knowledge about how that model was trained – for any kind of language model, an effective prompt typically looks like a "training example" for which the desired behavior matches what the training target would be on that example. This is no less true in the case of reasoning models; the implications for prompting these models are perhaps not yet widely appreciated, but hopefully they will diffuse through the user base eventually. * (Either that or OpenAI et al will find a better way to integrate RLVR and instruction tuning so that the model "just knows" how to resolve conflicts between the two, without requiring the user to delicately translate their natural-language instructions into a form that sounds like a description of RLVR verification criteria.) 1. ^ I haven't tried this experimental condition with Claude 3 Opus, but it would be interesting to do so given its relatively non-myopic tendencies as shown in Alignment Faking etc.

The Base Model Lens

Adam Newgas

Many of the stupid errors that LLMs make are attributable to behaviours they learn in during pre-training, then fail to forget or suppress later. This post is here as your reminder that base model behaviour often shows through, and should be one of your foremost theories when trying to explain perplexing LLM results.

The Riddle of Confusing Riddles

A major example of using this kind of thinking would be the failure of LLMs to deal with classic riddles with slight variants, such as: "A father and son are in a car crash, the father dies, and the son is rushed to the hospital. The surgeon says, 'I can't operate, that boy is my son,' who is the surgeon?”

Ethan Mollick supplies the above example, where the model immediately jumps to...

(See More – 800 more words)

LessWrong Feed [new, now in beta]

Ruby

1mo

The modern internet is replete with feeds such as Twitter, Facebook, Insta, TikTok, Substack, etc. They're bad in ways but also good in ways. I've been exploring the idea that LessWrong could have a very good feed.

I'm posting this announcement with disjunctive hopes: (a) to find enthusiastic early adopters who will refine this into a great product, or (b) find people who'll lead us to an understanding that we shouldn't launch this or should launch it only if designed a very specific way.

You can check it out right now: www.lesswrong.com/feed

From there, you can also enable it on the frontpage in place of Recent Discussion. Below I have some practical notes on using the New Feed.

Note! This feature is very much in beta. It's rough around the edges.

...

(Continue Reading – 2163 more words)

Roman Malov16m10

Also, I don't like that if I click on the post in the update feed and then refresh the page I loose the post

1Roman Malov17m

It might be that I have smth wrong with the internet but this beige widget isn't loading

1Rana Dexsin4h

Without having made much adaptation effort: The “open stuff in a modal overlay page on top of the feed rather than linking normally, incidentally making the URL bar useless” is super confusing and annoying. Just now, when I tried to use my usual trick of Open Link in New Tab for getting around confusingly overridden navigation on the “Click to view all comments” link to this very thread, it wasn't an actual link at all. I don't know how to interpret what's going on when I'm only shown a subset of comments in a feed section and they don't seem to be contiguous. Usually I feel like I'm missing context, fumble around with the controls that seem related to the relationships between comments to try to open the intervening ones, still can't tell whether I'm missing things (I think the indentation behavior is different? How do I tell how many levels there are between two comments that look like they have a parent^N relationship?), and wind up giving up and trying to find a good thread parent to see in its entirety.

Foom & Doom 1: “Brain in a box in a basement”

235

Steven Byrnes

Ω 7113d

1.1 Series summary and Table of Contents

This is a two-post series on AI “foom” (this post) and “doom” (next post).

A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via . Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...

(Continue Reading – 8630 more words)

Haiku39m10

The problem is: public advocacy is way too centered on LLMs, from my perspective.^[9] Thus, those researchers I mentioned, who are messing around with new paradigms on arXiv, are in a great position to twist “Pause AI” type public advocacy into support for what they’re doing!

I am a long-time volunteer with the organization bearing the name PauseAI. Our message is that increasing AI capabilities is the problem -- not which paradigm is used to get there. The current paradigm is dangerous in some fairly legible ways, but that doesn't at all imply tha... (read more)

2Cole Wyeth5h

I'm not sure we disagree then.

2Noosphere896h

I think the most important crux around takeoff speeds discussions, other than how fast AI can get smarter without more compute, is how much we should expect superintelligence to be meaningfully hindered by logistics issues by default. In particular, assuming the existence of nanotech as Drexler envisioned would mostly eliminate the need for long supply chains, and would allow forces to be supplied entirely locally through a modern version of living off the land. This is related to prior takeoff speeds discussions, as even if we assume the existence of technology that mostly eliminates logistical issues, it might be too difficult to develop in a short enough time to actually matter for safety-relevance. I actually contend that a significant (though not all) of the probability of doom from AI risk fundamentally relies on the assumption that superintelligence can fundamentally trivialize the logistics cost of doing things, especially on actions which require long supply lines like war, because if we don't assume this, then takeover is quite a lot harder and has much more cost for the AI, meaning stuff like AI control/dealmaking has a much higher probability of working, because the AI can't immediately strike on it's own, and needs to do real work on acquiring physical stuff like getting more GPUs, or more robots for an AI rebellion. Indeed, I think the essential assumption of AI control is that AIs can't trivialize away logistics costs by developing tech like nanotech, because this means their only hope of actually getting real power is by doing a rogue insider deployment, because currently we don't have the resources outside of AI companies to actually support a superintelligent AI outside the lab, due to interconnect issues. I think a core disagreement with people that are much more doomy than me, like Steven Byrnes or Eliezer Yudkowsky and Nate Soares, is probably due to me thinking that conditional on such tech existing, I think it almost certainly requires wa

1yrimon18h

Conclusion from reading this: My modal scenario in which LLMs become dangerously super intelligent is one where language is a good enough platform to think, memory use is like other tool use (so an LLM can learn to use memory well, enabling eg continuous on the job learning), and verification is significantly easier than generation (allowing a self improvement cycle in training).

X explains Z% of the variance in Y

154

Leon Lang

16d

Recently, in a group chat with friends, someone posted this Lesswrong post and quoted:

The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness.

I answered that, embarrassingly, even after reading Spencer Greenberg's tweets for years, I don't actually know what it means when one says:

$X$ explains $p$ of the variance in $Y$ .^[1]

What followed was a vigorous discussion about the correct definition, and several links to external sources like Wikipedia. Sadly, it seems to me that all online explanations (e.g. on Wikipedia here and here), while precise, seem philosophically wrong since they confuse the platonic concept of explained variance with the variance explained by a statistical model like linear regression.

The goal of this post is to give a conceptually satisfying definition of explained variance....

(Continue Reading – 2612 more words)

Stepan1h30

It really is an important, well-written post, and I very much enjoyed it. I especially appreciate the twin studies example. I even think that something like that should maybe go into the wikitags, because of how often the title sentence appears everywhere? I'm relatively new to LessWrong though, so I'm not sure about the posts/wikitags distinction, maybe that's not how it's done here.

I have a pitch for how to make it even better though. I think the part about "when you have lots of data" vs "when you have less data" would be cleaner and more intuitive if i... (read more)

3Stepan6h

Well, yes, in the sense that the law of large numbers applies, i.e. limn→∞Pr{|¯x−E[X]|<ε}=1∀ε>0 The condition for that to hold is actually weaker. If all the xi are not only drawn from the same distribution, but are also independent, the existence of a finite E[X] is necessary and sufficient for the sample mean to converge in probability to E[X] as n goes to infinity, if I understand the theorem correctly (I can't prove that yet though; the proof with a finite variance is easy). If xi aren't independent, the necessary condition is still weaker than the finite variance, but it's cumbersome and impractical, so finite variance is fine I guess. But that kind of isn't enough to always justify the use of a sample mean as an estimator in practice? As foodforthought says, for a normal distribution it's simultaneously the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but that's not true for other distributions. A quick example: suppose we want to determine the parameter p of a Bernoulli random variable, i.e. "a coin". The prior distribution over p is uniform; we flip the coin n=10 times, and use the sample success rate, kn, i.e. the mean, i.e. the maximum likelihood estimate. Per simulation the mean squared error E[(kn−p)2] is about 0.0167. However, if we use k+1n+2 instead, the mean squared error drops to 0.0139 (code). Honestly though, all of this seems like a frequentist cockamamie to me. We can't escape prior distributions; we may as well stop pretending that they don't exist. Just calculate a posterior and do whatever you want with it. E.g., how did I come up with the k+1n+2 example? Well, it's the expected value of the posterior beta distribution for p if the prior is uniform, so it also gives a lower MSE.

2Leon Lang8h

Thanks, I've replaced the word "likelihood" by "probability" in the comment above and in the post itself!

How did you first understand cognitive biases? Looking for community experiences

Vladimir Loginov

13h

Hi all,

Lately I’ve been particularly fascinated by cognitive biases—how they operate, how we recognise them, and what approaches can effectively counteract them. Because understanding these biases made such a difference for me personally, I decided to help others discover and learn about them as well. However, I've noticed that people's experiences with cognitive biases vary significantly, and I’m very curious about how others have learned or taught these concepts.

The reason I’m asking is that I’m currently building a free and open platform aimed at helping people learn about cognitive biases and practice debiasing in an interactive way. I’m relatively new to rationality myself, but interactive experiences seemed especially effective for me personally. For example, I implemented interactive study simulations, like Tversky & Kahneman's famous "wheel of fortune"...

(See More – 148 more words)

2ChristianKl8h

How do you know that it made the difference for you personally? Is it basically a good feeling or what evidence do you see that it made a big difference for you personally?

1Vladimir Loginov7h

Thanks for this question! I can say, with a high degree of probability, that knowing more about how cognitive biases work and how to avoid them when necessary sharpened my critical thinking and helped me make better decisions (Decisions resulting in higher value for myself) in various life situations. For instance, being aware of the sunk cost fallacy helps me move on in situations where my "attachment" to something is irrational. The following example is the least important one, but it is the one I like because I think it's easy to relate to: When I stop enjoying a series on Netflix after a few seasons, it's now much easier for me to stop "forcing" myself to complete it because I can now explain this irrational feeling and it is easier for me to move on. This, in turn, saves time, which is direct value. This is just one example of how being aware of cognitive biases or knowing how to resist them can make a difference for someone. The value of better decisions, of course, often goes far beyond a few hours saved. Having said that, I am not going to say it does not also simply feel good to know how our brains work (even if to such a small degree). In my opinion, even this feeling alone is worth sharing with others. Do you feel like the ability to resist or use cognitive biases to your advantage is not beneficial for you? Edit: Grammar

ChristianKl1h20

When it comes to rationality and biases, the key question is whether learning more about biases results in people just rationalizing their decisions more effectively or whether they are actually making better decisions.

As far as I understand the academic literature, there's little documented benefit of teaching people about cognitive biases even so various academics studied teaching people about cognitive biases.

CFAR started out partly with the idea that it might be good to teach people about cognitive biases but in their research they didn't f... (read more)

Roman Malov's Shortform

Roman Malov

7mo

Roman Malov1h10

A much better version of this idea: https://slatestarcodex.com/2017/11/09/ars-longa-vita-brevis/

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

AXRP Episode 45 - Samuel Albanie on DeepMind’s AGI Safety Approach

DanielFilan

Ω 121h

YouTube link

In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called “An Approach to Technical AGI Safety and Security”. It covers the assumptions made by the approach, as well as the types of mitigations it outlines.

Topics we discuss:

DeepMind’s Approach to Technical AGI Safety and Security
Current paradigm continuation
No human ceiling
Uncertain timelines
Approximate continuity and the potential for accelerating capability improvement
Misuse and misalignment
Societal readiness
Misuse mitigations
Misalignment mitigations
Samuel’s thinking about technical AGI safety
Following Samuel’s work

Daniel Filan (00:00:09): Hello, everybody. In this episode, I’ll be speaking with Samuel Albanie, a research scientist at Google DeepMind, who was previously an assistant professor working on computer vision. The...

(Continue Reading – 11807 more words)

"Buckle up bucko, this ain't over till it's over."

109

Raemon

The second in a series of bite-sized rationality prompts^[1].

Often, if I'm bouncing off a problem, one issue is that I intuitively expect the problem to be easy. My brain loops through my available action space, looking for an action that'll solve the problem. Each action that I can easily see, won't work. I circle around and around the same set of thoughts, not making any progress.

I eventually say to myself "okay, I seem to be in a hard problem. Time to do some rationality?"

And then, I realize, there's not going to be a single action that solves the problem. It is time to:

a) make a plan, with multiple steps

b) deal with the fact that many of those steps will be annoying

and c) notice that I'm not even...

(See More – 906 more words)

2benwr3h

Yeah I can try to say some of them, though my sense of the crucialness here does shift on learning that you mean for this to be about hour-to-week levels of effort. I guess I may as well try to come up with element-bending-flavored ones since I'm in pretty deep on that metaphor here. The biggest differences in which one I'd recommend as a "default last resort" depend on the person and their strengths rather than the situation. * The thing recommended for the airbender above, i.e. "retreat to a safe distance and consider if there's anything that seems like a fun challenge instead of a grind". * For a firebender: "Commit to doing it really hard for about [30 minutes]. See where you get. Wait [2 * 30 minutes] after that, and check if you feel energized or drained. If energized, repeat; if drained, [I don't know; this is the one I have the least familiarity with]" * For a waterbender: Something like "Do naturalism to it". "When you make good enough observations, you can't help but make high-quality inferences" is one of my favorite quotes, and I think it applies especially well in this kind of last-resort setting.

3Raemon2h

Nod, those all seem like good moves. I'm sort of torn between two more directions: On one hand, I actually didn't really mean "buckle up" to be very specific in terms of what move comes next. The most important thing is recognizing "this is a hard problem, your easy-mode cognitive tools probably won't work." (I think all the moves you list there are totally valid tools to bring to bear in that context, which are all more strategic than "just try the next intuitive thing") On the other hand... the OP does sure have a vibe about particular flavors of problem/solution, and it's not an accident that I wrote a thing that resonates with me with that you feel wary of. But, leaning into that... I'm a bit confused why the options you list here are "last resorts" as opposed to "the first thing you try once noticing the problem is hard". Like the airbender should be looking for a way for it to feel fun pretty early in the process. The "last resort" is whatever comes after all the tools that came more naturally to the airbender turn out not to work. (Which is in fact how Aang learned Earthbending). ((notably, I think I spent most of my life more airbendery. And the past ~2 years of me focusing on techniques that involve annoying effort is that the non-annoying-brute-force-y techniques weren't solving the problems I wanted to solve.)) But I think the first-hand is more the point – this is less about "the next steps will involve something annoying/powerthrough-y" and more "I should probably emotionally prepare for the possibility that the next steps will involve something annoying/power-through-y"

3benwr2h

Right, I think it just seems like doing emotional preparation that matches this description is a kind of earthbender-friendly / earthbender-assuming move, while an airbender-friendly move would be more like "notice and accept that you'd have more fun doing it a different way or doing a different thing; that flailing isn't actually fun". The effect is kind of similar, i.e. both earthbenders and airbenders should come away less-clinging-to-something, but the earthbender comes away less-clinging-to "the locally easy and straightforward things will work if I do them enough" while the airbender is less-clinging-to something more like "This is what I'd choose". Re the last-resort framing, I'm not sure why I said that exactly; I think it's related to the vibe I got from the OP: Like, "if you notice that you're not making progress, what do you do? Well, you could keep flailing or avoidantly doomscrolling, or you could [do the thing I'm suggesting], or you could give up in despair"; I think it feels like a "last resort" because the other realistic options presented are kind of like different kinds of death?

Raemon2h20

mm, okay yeah the distinction of different-ways-to-cling-less seems pretty reasonable.