All of Adam Scholl's Comments + Replies

I've been trying to spend a bit more time voting in response to this, to try to help keep thread quality high; at least for now, the size of the influx strikes me as low enough that a few long-time users doing this might help a bunch.

I agree we don't really understand anything in LLMs at this level of detail, but I liked Jan highlighting this confusion anyway, since I think it's useful to promote particular weird behaviors to attention. I would be quite thrilled if more people got nerd sniped on trying to explain such things!

John, it seems totally plausible to me that these examples do just reflect something like “hallucination,” in the sense you describe. But I feel nervous about assuming that! I know of no principled way to distinguish “hallucination” from more goal-oriented thinking or planning, and my impression is that nobody else does either.

I think it’s generally unwise to assume LLM output reflects its internal computation in a naively comprehensible way; it usually doesn’t, so I think it’s a sane prior to suspect it doesn't here, either. But at our current level of un... (read more)

If simple outcompetes complex, wouldn't we expect to see more prokaryotic DNA in the biosphere? Whereas in fact we see 2-3x as much eukaryotic DNA, depending on how you count—hardly a small niche!

I also found the writing way clearer than usual, which I appreciate - it made the post much easier for me to engage with.

As I understand it, the recent US semiconductor policy updates—e.g., CHIPS Act, export controls—are unusually extreme, which does seem consistent with the hypothesis that they're starting to take some AI-related threats more seriously. But my guess is that they're mostly worried about more mundane/routine impacts on economic and military affairs, etc., rather than about this being the most significant event since the big bang; perhaps naively, I suspect we'd see more obvious signs if they were worried about the latter, a la physics departments clearing out... (read more)

Critch, I agree it’s easy for most people to understand the case for AI being risky. I think the core argument for concern—that it seems plausibly unsafe to build something far smarter than us—is simple and intuitive, and personally, that simple argument in fact motivates a plurality of my concern. That said:

  • I think it often takes weirder, less intuitive arguments to address many common objections—e.g., that this seems unlikely to happen within our lifetimes, that intelligence far superior to ours doesn’t even seem possible, that we’re safe because softwar
... (read more)
8Ben Pace6mo
This is a candidate for the most surprising sentence in the whole comments section! I'd be interested in knowing more about why you believe this. One sort of thing I'd be quite interested in is things you've seen government ops teams do fast [https://patrickcollison.com/fast] (even if they're small things, accomplishments that would surprise many of us in this thread that they could be done so quickly).
3Noosphere896mo
This is an important optimistic update, because it implies alignment might be quite easier than we think, given that even under unfavorable circumstances, reasonable progress still gets done. I think that this isn't an error in rationality, and instead very different goals drive EAs/LWers compared to AI researchers. A low chance of high utility and a high chance of death is pretty rational to take, assuming you only care about yourself. And this is the default, absent additional assumptions. From an altruistic perspective, it's insane to take this risk, especially if you care about the future. Thus, differing goals are at play.

One comment in this thread compares the OP to Philip Morris’ claims to be working toward a “smoke-free future.” I think this analogy is overstated, in that I expect Philip Morris is being more intentionally deceptive than Jacob Hilton here. But I quite liked the comment anyway, because I share the sense that (regardless of Jacob's intention) the OP has an effect much like safetywashing, and I think the exaggerated satire helps make that easier to see.

The OP is framed as addressing common misconceptions about OpenAI, of which it lists five:

  1. Op
... (read more)
2William_S9mo
(I work at OpenAI). Is the main thing you think has the effect of safetywashing here the claim that the misconceptions are common? Like if the post was "some misconceptions I've encountered about OpenAI" it would mostly not have that effect? (Point 2 was edited to clarify that it wasn't a full account of the Anthropic split.)
9Ofer10mo
Another bit of evidence about OpenAI that I think is worth mentioning in this context: OPP recommended a grant of $30M [https://www.openphilanthropy.org/grants/openai-general-support/] to OpenAI in a deal that involved OPP's then-CEO becoming a board member of OpenAI. OPP hoped that this will allow them to make OpenAI improve their approach to safety and governance. Later, OpenAI appointed both the CEO's fiancée and the fiancée's sibling to VP positions.
4[anonymous]10mo
“the presence of which I take the OP to describe as reassuring” I get the sense from this, and from the rest of your comment here that you think we should in fact not find this even mildly reassuring. I’m not going to argue with such a claim, because I don’t think such an effort on my part would be very useful to anyone. However, if I’m not completely off base or I’m not overstating your position (which I totally could be) , then could you go into some more detail as to why you think that we shouldn’t find their presence reassuring at all?

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

Why, then, would they continue to build the technology which causes that risk? Why do they consider it morally acceptable to build something which might well end life on Earth?

A common view is that the timelines to risky AI are largely driven by hardware progress and deep learning progress occurring outside of OpenAI. Many people (both at OpenAI and elsewhere) believe that questions of who builds AI and how are very important relative to acceleration of AI timelines. This is related to lower estimates of alignment risk, higher estimates of the importance of geopolitical conflict, and (perhaps most importantly of all) radically lower estimates for the amount of useful alignment progress that would occur this far in advance of AI ... (read more)

Incorrect: OpenAI is not aware of the risks of race dynamics.

I don't think this is a common misconception. I, at least, have never heard anyone claim OpenAI isn't aware of the risk of race dynamics—just that it nonetheless exacerbates them. So I think this section is responding to a far dumber criticism than the one which people actually commonly make.

I don’t expect a discontinuous jump in AI systems’ generality or depth of thought from stumbling upon a deep core of intelligence

I felt surprised reading this, since "ability to automate AI development" feels to me like a central example of a "deep core of intelligence"—i.e., of a cognitive ability which makes attaining many other cognitive abilities far easier. Does it not feel like a central example to you?

4PeterMcCluskey10mo
I see a difference between finding a new core of intelligence versus gradual improvements to the core(s) of intelligence that we're already familiar with.
9Ajeya Cotra10mo
I don't see it that way, no. Today's coding models can help automate some parts of the ML researcher workflow a little bit, and I think tomorrow's coding models will automate more and more complex parts, and so on. I think this expansion could be pretty rapid, but I don't think it'll look like "not much going on until something snaps into place."

I could imagine this sort of fix mostly solving the problem for readers, but so far at least I've been most pained by this while voting. The categories "truth-tracking" and "true" don't seem cleanly distinguishable to me—nor do e.g. "this is the sort of thing I want to see on LW" and "I agree"—so now I experience type error-ish aversion and confusion each time I vote.

4Ben Pace1y
I see. I'd be interested in chatting about your experience with you offline, sometime this week.

I’m worried about this too, especially since I think it’s surprisingly easy here (relative to most fields/goals) to accidentally make the situation even worse. For example, my sense is people often mistakenly conclude that working on capabilities will help with safety somehow, just because an org's leadership pays lip service to safety concerns—even if the org only spends a small fraction of its attention/resources on safety work, actively tries to advance SOTA, etc.

A tongue-in-cheek suggestion for noticing this phenomena: when you encounter professions of concern about alignment, ask yourself whether it seems like the person making those claims is hoping you’ll react like the marine mammals in this DuPont advertisement, dancing to Beethoven’s “Ode to Joy” about the release of double-hulled oil tankers.

In the early 1900s the Smithsonian Institution published a book each year, which mostly just described their organizational and budget updates. But they each also contained a General Appendix at the end, which seems to have served a function analogous to the modern "Edge" essays—reflections by scientists of the time on key questions of interest. For example, the 1929 book includes essays speculating about what "life" and "light" are, how insects fly, etc.

For what it's worth, I quite dislike this change. Partly because I find it cluttered and confusing, but also because I think audience agreement/disagreement should in fact be a key factor influencing comment rankings.

In the previous system, my voting strategy roughly reflected the product of (how glad I was some comment was written) and (how much I agreed with it). I think this product better approximates my overall sense of how much I want to recommend people read the comment—since all else equal, I do want to recommend comments more insofar as I agree with them more.

I would be extremely surprised if karma does not track with agreement votes in the majority of cases. I only expect them to diverge in a narrow range of cases like excellently stated arguments people disagree with,  extremely banal comments that are true but don't really add anything, actual voting, and high social conflict posts. If we can operationalize this prediction I'm interested in a bet.

5Vladimir_Nesov1y
Even a completely wrong claim occasionally contributes relevant ideas to the discussion. A comment can contain many claims and ideas, and salient wrongness of some of the claims (or subjective opinions not shared by the voter) can easily coexist with correctness/relevance of other statements in the same comment. So upvote/disagree is a natural situation. Downvote/correct corresponds to something true that's trivial/irrelevant/inappropriate/unkind. Being forced to collapse such cases into a single scale is painful, and the resulting ranking is ambiguous to the point of uselessness.

all else equal, I do want to recommend comments more insofar as I agree with them more

It's a fair point. Sometimes the point of a thread is to discuss and explore a topic, and sometimes the point of a thread is to locally answer a question. In the former I want to reward the most surprising and new marginal information over the most obvious info. In the latter I just want to see the answer.

I'll definitely keep my eye out for whether this system breaks some threads, though it seems likely to me that "producing the right answer in a thread about answering a question" will be correctly upvoted in that context.

Partly because I find it cluttered and confusing, but also because I think audience agreement/disagreement should in fact be a key factor influencing comment rankings.

I have a different ontology here. I'd say that "truth-tracking" is pretty different from "true". A comment section with just the audience's main beliefs highly upvoted is different from one where the conversational moves that seem truth-tracking are highly upvoted. The former leans more easily into an echo-chamber than the latter, which better rewards side-ways moves and thoughtful arguments for positions most people disagree with.

It's true some CFAR staff have used psychedelics, and I'm sure they've sometimes mentioned that in private conversation. But CFAR as an institution never advocated psychedelic use, and that wasn't just because it was illegal, it was because (and our mentorship and instructor trainings emphasize this) psychedelics often harm people.

3Unreal2y
I'd be interested in hearing from someone who was around CFAR in the first few years to double check that the same norm was in place. I wasn't around before 2015. 

I agree manager/staff relations have often been less clear at CFAR than is typical. But I'm skeptical that's relevant here, since as far as I know there aren't really even borderline examples of this happening. The closest example to something like this I can think of is that staff occasionally invite their partners to attend or volunteer at workshops, which I think does pose some risk of fucky power dynamics, albeit dramatically less risk imo than would be posed by "the clear leader of an organization, who's revered by staff as a world-historically import... (read more)

5Thrasymachus2y
I think CFAR ultimately succeeded in providing a candid and good faith account of what went wrong, but the time it took to get there (i.e. 6 months between this and the initial [https://rationality.org/resources/updates/2018/acdc] update/apology) invites adverse inferences like those in the grandparent.  A lot of the information ultimately disclosed in March would definitely have been known to CFAR in September, such as Brent's prior involvement as a volunteer/contractor for CFAR, his relationships/friendships with current staff, and the events as ESPR. The initial responses remained coy on these points, and seemed apt to give the misleading impression CFAR's mistakes were (relatively) much milder than they in fact were. I (among many) contacted CFAR leadership to urge them to provide more candid and complete account when I discovered some of this further information independently.  I also think, similar to how it would be reasonable to doubt 'utmost corporate candour' back then given initial partial disclosure, it's reasonable to doubt CFAR has addressed the shortcomings revealed given the lack of concrete follow-up. I also approached CFAR leadership when CFAR's 2019 Progress Report and Future Plans [https://www.lesswrong.com/posts/vj6CYLuDPw3ieCB4A/cfar-progress-report-and-future-plans-1]initially made no mention of what happened with Brent, nor what CFAR intended to improve in response to it. What was added in is not greatly reassuring: A cynic would note this is 'marking your own homework', but cynicism is unnecessary to recommend more self-scepticism. I don't doubt the Brent situation indeed inspired a lot of soul searching and substantial, sincere efforts to improve. What is more doubtful (especially given the rest of the morass of comments) is whether these efforts actually worked. Although there is little prospect of satisfying me, more transparency over what exactly has changed - and perhaps third party oversight and review - may better reassure others.

I also feel really frustrated that you wrote this, Anna. I think there are a number of obvious and significant disanalogies between the situations at Leverage versus MIRI/CFAR. There's a lot to say here, but a few examples which seem especially salient:

  • To the best of my knowledge, the leadership of neither MIRI nor CFAR has ever slept with any subordinates, much less many of them.
  • While I think staff at MIRI and CFAR do engage in motivated reasoning sometimes wrt PR, neither org engaged in anything close to the level of obsessive, anti-epistemic reputationa
... (read more)

I endorse Adam's commentary, though I did not feel the frustration Eli and Adam report, possibly because I know Anna well enough that I reflexively did the caveating in my own brain rather than modeling the audience.

7Benquo2y
This issue doesn't seem particularly important to me but the comparison you're making is a good example of a more general problem I want to talk about. My impression is that the decision structure of CFAR was much less legible & transparent than that of Leverage, so that it would be harder to determine who might be treated as subordinate to whom in what context. In addition, my impression from the years I was around is that Leverage didn't preside over as much of an external scene, - Leverage followers had formalized roles as members of the organization, while CFAR had a "community," many of whom were workshop alumni. Am I missing something here? The communication I read from CFAR seemed like it was trying to reveal as little as it could get away with, gradually saying more (and taking a harsher stance towards Brent) in response to public pressure, not like it was trying to help me, a reader, understand what had happened.

Yeah, sorry. I agree that my comment “the OP speaks for me” is leading a lot of people to false views that I should correct. It’s somehow tricky because there’s a different thing I worry will be obscured by my doing this, but I’ll do it anyhow as is correct and try to come back for that different thing later.

To the best of my knowledge, the leadership of neither MIRI nor CFAR has ever slept with a subordinate, much less many of them.

Agreed.

While I think staff at CFAR and MIRI probably engaged in motivated reasoning sometimes wrt PR, neither org eng

... (read more)

I like the local discourse norm of erring on the side of assuming good faith, but like steven0461, in this case I have trouble believing this was misleading by accident. Given how obviously false, or at least seriously misleading, many of these claims are (as I think accurately described by Anna/Duncan/Eli), my lead hypothesis is that this post was written by a former staff member, who was posing as a current staff member to make the critique seem more damning/informed, who had some ax to grind and was willing to engage in deception to get it ground, or something like that...?

It seems misleading in a non-accidental way, but it seems fairly plausible that their main motive was to obscure their identity.

8jessicata2y
PhoenixFriend edited the comment.

FYI I just interpreted it to mean "former staff member" automatically. (This is biased by my belief that CFAR has very few current staff members so of course it was highly unlikely to be one, but I don't think it was an unreasonably weird reading)

Sure, but they led with "I'm a CFAR employee," which suggests they are a CFAR employee. Is this true?

It sounds like they meant they used to work at CFAR, not that they currently do. 

Also given the very small number of people who work at CFAR currently, it would be very hard for this person to retain anonymity with that qualifier so... 

I think it's safe to assume they were a past employee... but they should probably update their comment to make that clearer because I was also perplexed by their specific phrasing. 

I've worked at CFAR for most of the last 5 years, and this comment strikes me as so wildly incorrect and misleading that I have trouble believing it was in fact written by a current CFAR employee. Would you be willing to verify your identity with some mutually-trusted 3rd party, who can confirm your report here? Ben Pace has offered to do this for people in the past.

I don't know if you trust me, but I confirmed privately that this person is a past or present CFAR employee.

It looks to me like one can buy this Lyme vaccine online without a prescription.

Are you tempted to drop or reduce the size of this trade in light of the UK seeming to have (roughly speaking, for now at least) contained B.1.1.7?

Yeah, makes sense. Fwiw, I have encountered one purportedly 97+ CRI lamp that looked awful to me. 

I really appreciate you writing this!

Just wanted to add that my informal impression from a few experiments is that the difference between 90 CRI and 95+ CRI is actually large. 

Thanks!

Sounds about right for CRI. I think there are a couple things going on with it:

  1. CRI is a mediocre measure to begin with, as far as the subjective quality of the light is concerned
  2. As far as I know, there's no oversight, third party measurement, etc

I'm not sure how much of it is bad measurement and how much of it is CRI being a poor metric, but the best 85 CRI bulbs I've seen are substantially better than the worst 90 CRI bulbs, which is why I'm hesitant to tell people to rule out 85 CRI bulbs entirely. I've not encountered any 95 CRI bulbs that are bad, so maybe the better advice is just to go for 95+ CRI whenever possible.

Another (unlikely, but more likely than almost all other ancient people) candidate for partial future revival: During the 79 AD eruption of Vesuvius, part of this man's brain was vitrified.

Your posts about the neocortex have been a plurality of the posts I've been most excited to read this year. I'm super interested in the questions you're asking, and it drives me nuts that they're not asked more in the neuroscience literature.

But there's an aspect of these posts I've found frustrating, which is something like the ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."

Interestingly, I also have this gripe when reading Friston and Hawkins. And I feel like I als... (read more)

4Steven Byrnes3y
(Oops I just noticed that I had missed one of your questions in my earlier responses) I don't think there's anything to Bayesian priors beyond the general "society of compositional generative models" framework. For example, we have a prior that if someone runs towards a bird, it will fly away. There's a corresponding generative model: in that model, first there's a person running towards a bird, and then the bird is flying away. All of us have that generative model prominently in our brains, having seen it happen a bunch of times in the past. So when we see a person running towards a bird, that generative model gets activated, and it then sends a prediction that the bird is about to fly away. (Right? Or sorry if I'm misunderstanding your question.) (Not sure what you saw about dopamine distributions. I think everyone agrees that dopamine distributions are relevant to reward prediction, which I guess is a special case of a prior. I didn't think it was relevant for non-reward-related-priors, like the above prior above bird behavior, but I don't really know, I'm pretty hazy on my neurotransmitters, and each neurotransmitter seems to do lots of unrelated things.)

Your posts about the neocortex have been a plurality of the posts I've been most excited reading this year.

Thanks so much, that really means a lot!!

...ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."

I agree with "theories/frameworks relatively scarce". I don't feel like I have multiple gears-level models of how the brain might work, and I'm trying to figure out which one is right. I feel like I have zero, and I'm trying to grope my way towards one. It's almost more li... (read more)

Have you thought much about whether there are parts of this research you shouldn't publish?

Yeah, sure. I have some ideas about the gory details of the neocortical algorithm that I haven't seen in the literature. They might or might not be correct and novel, but at any rate, I'm not planning to post them, and I don't particularly care to pursue them, under the circumstances, for the reasons you mention.

Also, there was one post that I sent for feedback to a couple people in the community before posting, out of an abundance of caution. Neither person saw it a... (read more)

This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms]

I do mention interpreting the described results as tentative evidence for mesa-optimization, and this interpretation was why I wrote the post; my impression is still that this interpretation was basically correct. But most of the post is just quotes or paraphrased claims made by DeepMind researchers, rather than my own claims, since I didn't feel sure enough to make the claims myself.

I feel confused about why, on this model, the researchers were surprised that this occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just weren't very familiar with AI. But looking at the author list, and their publications (e.g.1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. Most of the co-authors are neuroscientists by training, but a few have CS degrees, and all but one have co-authored previous ML papers. It's hard for me to i... (read more)

8Rohin Shah3y
I think that's a fine characterization (and I said so in the grandparent comment? Looking back, I said I agreed with the claim that learning is happening via neural net activations, which I guess doesn't necessarily imply that I think it's a fine characterization). I think my original comment didn't do a great job of phrasing my objection. My actual critique is that the community as a whole seems to be updating strongly on data-that-has-high-probability-if-you-know-basic-RL. That was one of three possible explanations; I don't have a strong view on which explanation is the primary cause (if any of them are). It's more like "I observe clearly-to-me irrational behavior, this seems bad, even if I don't know what's causing it". If I had to guess, I'd guess that the explanation is a combination of readers not bothering to check details and those who are checking details not knowing enough to point out that this is expected. Indeed, I am also confused by this, as I noted in the original comment: I have a couple of hypotheses, none of which seem particularly likely given that the authors are familiar with AI, so I just won't speculate. I agree this is evidence against my claim that this would be obvious to RL researchers. Again, I don't object to the description of this as learning a learning algorithm. I object to updating strongly on this. Note that the paper does not claim their results are surprising -- it is written in a style of "we figured out how to make this approach work". (The DeepMind paper does claim that the results are novel / surprising, but it is targeted at a neuroscience audience, to whom the results may indeed be surprising.) On the search panpsychist view, my position is that if you use deep RL to train an AGI policy, it is definitionally a mesa optimizer. (Like, anything that is "generally intelligent" has the ability to learn quickly, which on the search panpsychist view means that it is a mesa optimizer.) So in this world, "likelihood of mesa

I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.

That said, I feel confused by a number of your arguments, so I'm working on a reply. Before I post it, I'd be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.

I currently understand you to be making four main claims:

  1. The system is just doing the total
... (read more)
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.

Thanks. I know I came off pretty confrontational, sorry about that. I didn't mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.

I don't think you've exactly captured what I meant, some comments below.

The system is just doing the totally normal thing “co
... (read more)

The scenario I had in mind was one where death occurs as a result of damage caused by low food consumption, rather than by suicide.

One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I'd expect that a failure mode leading to systematically low DA rewards would usually be corrected

I'm not sure divorce like this is rare. For example, anorexia sometimes causes people to find food anti-rewarding (repulsive/inedible, even when they're dying and don't to be), and I can imagine that being because PFC actually somehow alters DAs reward function.

But I do share the hunch that something like a "divorce resistance" trick occurs a... (read more)

I think it makes more sense to operationalize "catastrophic" here as "leading to systematically low DA reward

Thanks—I do think this operationalization makes more sense than the one I proposed.

3Adam Scholl3y
I'm not sure divorce like this is rare. For example, anorexia sometimes causes people to find food anti-rewarding (repulsive/inedible, even when they're dying and don't to be), and I can imagine that being because PFC actually somehow alters DAs reward function. But I do share the hunch that something like a "divorce resistance" trick occurs and is helpful. I took Kaj and Steve to be gesturing at something similar elsewhere in the thread. But I notice feeling confused about how exactly this trick might work. Does it scale...? I have the intuition that it doesn't—that as the systems increase in power, divorce occurs more easily. That is, I have the intuition that if PFC were trying, so to speak, to divorce itself from DA supervision, that it could probably find some easy-ish way to succeed, e.g. by reconfiguring itself to hide activity from DA, or to send reward-eliciting signals to DA regardless of what goal it was pursuing.

Kaj, the point I understand you to be making is: "The inner RL algorithm in this scenario is probably reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter's objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer, causing it to reconfigure itself to be better aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly." Does ... (read more)

3Kaj_Sotala3y
That seems like a reasonable paraphrase, at least if you include the qualification that the "quickly" is relative to the amount of structure that the inner layer has accumulated, so might not actually happen quickly enough to be useful in all cases. Sure, e.g. lots of exotic sexual fetishes look like that to me. Hmm, though actually that example makes me rethink the argument that you just paraphrased, given that those generally emerge early in an individual's life and then generally don't get "corrected".

Ah, I see. The high death rate was what made it seem often-catastrophic to me. Is your objection that the high death rate doesn't reflect something that might reasonably be described as "optimizing for one goal at the expense of all others"? E.g., because many of the deaths are suicides, in which case persistence may have been net negative from the perspective of the rest of their goals too? Or because deaths often result from people calibratedly taking risky but non-insane actions, who just happened to get unlucky with heart muscle integrity or whatever?

1Douglas_Knight3y
I asked you if you were talking about starving to death and you didn't answer. Does your abstract claim correspond to a concrete claim, or do you just observe that anorexics seem to have a goal and assume that everything must flow from that and the details don't matter? That's a perfectly reasonable claim, but it's a weak claim so I'd like to know if that's what you mean. Abrupt suicides by anorexics are just as mysterious as suicides by schizophrenics and don't seem to flow from the apparent goal of thinness. Suicide is a good example of something, but I don't think it's useful to attach it to anorexia rather than schizophrenia or bipolar. Long-term health damage would be a reasonable claim, which I tried to concede in my original comment. I'm not sure I agree with it. I could pose a lot of complaints about it, but I wouldn't. If it's clear that it is the claim, then I think it's clearly a weak claim and that's OK. (As for the objection you propose, I would rather say: lots of people take badly calibrated risks without being labeled insane.)

Yeah, I wrote that confusingly, sorry; edited to clarify. I just meant that of the limited set of candidate examples I'd considered, my model of anorexia, which of course may well be wrong, feels most straightforwardly like an example of something capable of causing catastrophic within-brain inner alignment failure. That is, it currently feels natural to me to model anorexia as being caused by an optimizer for thinness arising in brains, which can sometimes gain sufficient power that people begin to optimize for that goal at the expense of essentially all other goals. But I don't feel confident in this model.

2Douglas_Knight3y
I'm objecting to the claim that it fits your criterion of "catastrophic." Maybe it's such a clear example, with such a clear goal, that we should sacrifice the criterion of catastrophic, but you keep using that word.

I agree, in the case of evolution/humans. I meant to highlight what seemed to me like a relative lack of catastrophic within-mind inner alignment failures, e.g. due to conflicts between PFC and DA. Death of the organism feels to me like one reasonable way to operationalize "catastrophic" in these cases, but I can imagine other reasonable ways.

9abramdemski3y
I think it makes more sense to operationalize "catastrophic" here as "leading to systematically low DA reward", perhaps also including "manipulating the DA system in a clearly misaligned way". One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I'd expect that a failure mode leading to systematically low DA rewards would usually be corrected gradually, as the DA punishes those patterns. However, this is not really clear. The misaligned PFC might e.g. put itself in a local maximum, where it creates DA punishment for giving into temptation. (For example, an ascetic getting social reinforcement from a group of ascetics might be in such a situation.)

As I understand it, your point about the distinction between "mesa" and "steered" is chiefly that in the latter case, the inner layer is continually receiving reward signal from the outer layer, which in effect heavily restricts the space of possible algorithms the outer layer might give rise to. Does that seem like a decent paraphrase?

One of the aspects of Wang et al.'s paper that most interested me was that the inner layer in their meta-RL model kept learning even once reward signal from the outer layer had ceased. It feels plausible to me that the relat... (read more)

6Steven Byrnes3y
Yeah, that's part of it, but also I tend to be a bit skeptical that a performance-competitive optimizer will spontaneously develop, as opposed to being programmed—just as AlphaGo does MCTS because DeepMind programmed it to do MCTS, not because it was running a generic RNN that discovered MCTS. See also this [https://www.lesswrong.com/posts/SkcM4hwgH3AP6iqjs/can-you-get-agi-from-a-transformer]. Right now I'm kinda close to "More-or-less every thought I think has higher DA-related reward prediction than other potential thoughts I could have thought."  But it's a vanishing fraction of cases where there is "ground truth" for that reward prediction that comes from outside of the neocortex. There is "ground truth" for things like pain and fear-of-heights, but not for thinking to yourself "hey, that's a clever turn of phrase" when you're writing. (The neocortex is the only place that understands language, in this example.) Ultimately I think everything has to come from subcortex-provided "ground truth" on what is or isn't rewarding, but the neocortex can get the idea that Concept X is an appropriate proxy / instrumental goal associated with some subcortex-provided reward, and then it goes and labels Concept X as inherently desirable, and searches for actions / thoughts that will activate Concept X. There's still usually some sporadic "ground truth", e.g. you have an innate desire for social approval and I think the subcortex has ways to figure out when you do or don't get social approval, so if your "clever turns of phrase" never impress anyone, you might eventually stop trying to come up with them. But if you're a hermit writing a book, the neocortex might be spinning for years treating "come up with clever turns of phrase" as an important goal, without any external subcortex-provided information to ground that goal. See here [https://www.lesswrong.com/posts/DWFx2Cmsvd4uCKkZ4/inner-alignment-in-the-brain] for more on this, if you're not sick of my endless self-citatio

It could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don't regularly experience catastrophic inner alignment failures internally.

In practice I do suspect humans regularly experience internal inner alignment failures, but given that suspicion I feel surprised by how functional humans do manage to be. In other words, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.

In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.

I don't know why you expect an inner alignment failure to look dysfunctional. Instrumental convergence suggests that it would look functional. What the world looks like if there... (read more)

3[anonymous]3y
What would inner alignment failures even look like? Overdosing on meth sure makes the dopamine system happy. Perhaps human values reside in the prefrontal complex, and all of humanity is a catastrophic alignment failure of the dopamine system (except a small minority of drug addicts) on top of being a catastrophic alignment failure of natural selection.

The thing I meant by "catastrophic" is just "leading to death of the organism." I suspect mesa-optimization is common in humans, but I don't feel confident about this, nor that this is a joint-carvey ontology. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just "everyday personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomena which might reasonably be described as inner alignment failures.

But I think these things don't kill people very... (read more)

3Douglas_Knight3y
Why do you single out anorexia? Do you mean people starving themselves to death? My understanding is that is very rare. Anorexics have a high death rate and some of that is long-term damage from starvation. They also (abruptly) kill themselves at a high rate, comparable to schizophrenics, but why single that out? There's a theory that they have practice with internal conflict, which does seem relevant, but I think that's just a theory, not clear cut at all.
2Raemon3y
This doesn't seem like what it should mean here. I'd think catastrophic in the context of "how humans (programmed by evolution) might fail by evolution's standards" should mean "start pursuing strategies that don't result in many children or longterm population success." (where premature death of the organism might be one way to cause that, but not the only way)
2Kaj_Sotala3y
As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances changed so that the more specialized behavior was no longer appropriate, then (assuming the RNN's weights hadn't been frozen) the feedback to the outer network would gradually end up reconfiguring the internal algorithm as well. So I'm not sure how it even could end up with something that's "unrecognizably different" from the base objective - even after a distributional shift, the learned objective would probably still be recognizable as a special case of the base objective, until it updated to match the new situation. The thing that I would expect to see from this description, is that humans who were e.g. practicing a particular skill might end up becoming overspecialized to the circumstances around that skill, and need to occasionally relearn things to fit a new environment. And that certainly does seem to happen. Likewise for more general/abstract skills, like "knowing how to navigate your culture/technological environment", where older people's strategies are often more adapted to how society used to be rather than how it is now - but still aren't incapable of updating. Catastrophic misalignment seems more likely to happen in the case of something like evolution, where the two learning algorithms operate on vastly different timescales, and it takes a very long time for evolution to correct after a drastic distributional shift. But the examples in Wang et al. lead me to think that in the brain, even the slower process operates on a timescale that's on the order of days rather than years, allowing for reasonably rapid adjustments in response to distributional shifts. (Though it's plausible that the more structure there is in a need of readjustment, t
6orthonormal3y
The claim that came to my mind is that the conscious mind is the mesa-optimizer here, the original outer optimizer being a riderless elephant.

Governments and corporations experience inner alignment failures all the time, but because of convergent instrumental goals, they are rarely catastrophic. For example, Russia underwent a revolution and a civil war on the inside, followed by purges and coups etc., but from the perspective of other nations, it was more or less still the same sort of thing: A nation, trying to expand its international influence, resist incursions, and conquer more territory. Even its alliances were based as much on expediency as on shared ideology.

Perhaps something similar happens with humans.

For similar reasons, I allocate a small portion of my portfolio toward assets (including Nvidia) that might appreciate rapidly during slow takeoff, in the thinking that there might be some slow takeoff scenarios in which the extra resources prove helpful. My main reservation is Paul Christiano's argument that investment/divestment has more-than-symbolic effects.

1sairjy17d
Seems it was a good call. 

I made Twitter lists of DeepMind and OpenAI researchers, and find them useful for tracking team zeitgeists.

I found LinkedIn's background breakdown of DeepMind employees interesting; fewer neuroscience backgrounds than I would have expected.

I found this post super interesting, and appreciate you writing it. I share the suspicion/hope that gaining better understanding of brains might yield safety-relevant insights.

I’m curious what you think is going on here that seems relevant to inner alignment. Is it that you’re modeling neocortical processes (e.g. face recognizers in visual cortex) as arising from something akin to search processes conducted by similar subcortical processes (e.g. face recognizers in superior colliculus), and noting that there doesn’t seem to be much divergence between their objective functions, perhaps because of helpful features of subcortex-supervised learning like e.g. these subcortical input-dependent dynamic rewiring rules?

7Steven Byrnes3y
FYI, I now have a whole post elaborating on "inner alignment": mesa-optimizers vs steered optimizers [https://www.lesswrong.com/posts/SJXujr5a2NcoFebr4/mesa-optimizers-vs-steered-optimizers]
5Steven Byrnes3y
Thanks! Hmm, I guess I didn't go into detail on that. Here's what I'm thinking. For starters, what is inner alignment anyway? Maybe I'm abusing the term, but I think of two somewhat different scenarios. * In a general RL setting, one might say that outer alignment is alignment between what we want and the reward function, and inner alignment is alignment between the reward function and "what the system is trying to do". (This one is closest to how I was implicitly using the term in this post.) * In the "risks from learned optimization" paper, it's a bit different: the whole system (perhaps an RL agent and its reward function, or perhaps something else entirely) is conceptually bundled together into a single entity, and you do a black-box search for the most effective "entity". In this case, outer alignment is alignment between what we want and the search criterion, and inner alignment is alignment between the search criterion and "what the system is trying to do". (This is not really what I had in mind in this post, although it's possible that this sort of inner alignment could also come up, if we design the system by doing an outer search, analogous to evolution.) Note that neither of these kinds of "inner alignment" really comes up in existing mainstream ML systems. In the former (RL) case, if you think of an RL agent like AlphaStar, I'd say there isn't a coherent notion of "what the system is trying to do", at least in the sense that AlphaStar does not do foresighted planning towards a goal. Or take AlphaGo, which does have foresighted planning because of the tree search; but here we program the tree search by hand ourselves, so there's no risk that the foresighted planning is working towards any goal except the one that we coded ourselves, I think. So, "RL systems that do foresighted planning towards explicit goals which it invents itself" are not much of a thing these days (as far as I know), but they presumably will

I wouldn't describe any posts I've seen as conveying the idea sufficiently well for my taste, but would describe some—like this NY Times piece—as adequately conveying the most decision-relevant points.

When I started writing, there was almost no discussion online (aside from Wei Dai's comment here, and the posts it links to) about what factors might prove limiting for the provision of hospital care, or about the degree to which those limits might be exceeded. By the time I called off the project, the US President and ~every major newspaper were talking abou... (read more)

Update: We decided not to finish this post, since the points we wished to convey have now mostly been covered well elsewhere; Kyle may still write up his notes about the epidemiological parameters at some point.

7habryka3y
Alas. Could you briefly link to the other places that have conveyed the ideas sufficiently well for your tastes? 

I'm currently working with Kyle Scott and Anna Salamon on an estimate of deaths due to hospital overflow (lack of access to oxygen, mechanical ventilation, ICU beds), which we'll hopefully post in the next few days. The post will review evidence about basic epidemiological parameters.

4habryka3y
Great, looking forward to the post!
Load More