I agree we don't really understand anything in LLMs at this level of detail, but I liked Jan highlighting this confusion anyway, since I think it's useful to promote particular weird behaviors to attention. I would be quite thrilled if more people got nerd sniped on trying to explain such things!
John, it seems totally plausible to me that these examples do just reflect something like “hallucination,” in the sense you describe. But I feel nervous about assuming that! I know of no principled way to distinguish “hallucination” from more goal-oriented thinking or planning, and my impression is that nobody else does either.
I think it’s generally unwise to assume LLM output reflects its internal computation in a naively comprehensible way; it usually doesn’t, so I think it’s a sane prior to suspect it doesn't here, either. But at our current level of un...
I also found the writing way clearer than usual, which I appreciate - it made the post much easier for me to engage with.
As I understand it, the recent US semiconductor policy updates—e.g., CHIPS Act, export controls—are unusually extreme, which does seem consistent with the hypothesis that they're starting to take some AI-related threats more seriously. But my guess is that they're mostly worried about more mundane/routine impacts on economic and military affairs, etc., rather than about this being the most significant event since the big bang; perhaps naively, I suspect we'd see more obvious signs if they were worried about the latter, a la physics departments clearing out...
Critch, I agree it’s easy for most people to understand the case for AI being risky. I think the core argument for concern—that it seems plausibly unsafe to build something far smarter than us—is simple and intuitive, and personally, that simple argument in fact motivates a plurality of my concern. That said:
One comment in this thread compares the OP to Philip Morris’ claims to be working toward a “smoke-free future.” I think this analogy is overstated, in that I expect Philip Morris is being more intentionally deceptive than Jacob Hilton here. But I quite liked the comment anyway, because I share the sense that (regardless of Jacob's intention) the OP has an effect much like safetywashing, and I think the exaggerated satire helps make that easier to see.
The OP is framed as addressing common misconceptions about OpenAI, of which it lists five:
Incorrect: OpenAI leadership is dismissive of existential risk from AI.
Why, then, would they continue to build the technology which causes that risk? Why do they consider it morally acceptable to build something which might well end life on Earth?
A common view is that the timelines to risky AI are largely driven by hardware progress and deep learning progress occurring outside of OpenAI. Many people (both at OpenAI and elsewhere) believe that questions of who builds AI and how are very important relative to acceleration of AI timelines. This is related to lower estimates of alignment risk, higher estimates of the importance of geopolitical conflict, and (perhaps most importantly of all) radically lower estimates for the amount of useful alignment progress that would occur this far in advance of AI ...
Incorrect: OpenAI is not aware of the risks of race dynamics.
I don't think this is a common misconception. I, at least, have never heard anyone claim OpenAI isn't aware of the risk of race dynamics—just that it nonetheless exacerbates them. So I think this section is responding to a far dumber criticism than the one which people actually commonly make.
I don’t expect a discontinuous jump in AI systems’ generality or depth of thought from stumbling upon a deep core of intelligence
I felt surprised reading this, since "ability to automate AI development" feels to me like a central example of a "deep core of intelligence"—i.e., of a cognitive ability which makes attaining many other cognitive abilities far easier. Does it not feel like a central example to you?
I could imagine this sort of fix mostly solving the problem for readers, but so far at least I've been most pained by this while voting. The categories "truth-tracking" and "true" don't seem cleanly distinguishable to me—nor do e.g. "this is the sort of thing I want to see on LW" and "I agree"—so now I experience type error-ish aversion and confusion each time I vote.
I’m worried about this too, especially since I think it’s surprisingly easy here (relative to most fields/goals) to accidentally make the situation even worse. For example, my sense is people often mistakenly conclude that working on capabilities will help with safety somehow, just because an org's leadership pays lip service to safety concerns—even if the org only spends a small fraction of its attention/resources on safety work, actively tries to advance SOTA, etc.
A tongue-in-cheek suggestion for noticing this phenomena: when you encounter professions of concern about alignment, ask yourself whether it seems like the person making those claims is hoping you’ll react like the marine mammals in this DuPont advertisement, dancing to Beethoven’s “Ode to Joy” about the release of double-hulled oil tankers.
In the early 1900s the Smithsonian Institution published a book each year, which mostly just described their organizational and budget updates. But they each also contained a General Appendix at the end, which seems to have served a function analogous to the modern "Edge" essays—reflections by scientists of the time on key questions of interest. For example, the 1929 book includes essays speculating about what "life" and "light" are, how insects fly, etc.
For what it's worth, I quite dislike this change. Partly because I find it cluttered and confusing, but also because I think audience agreement/disagreement should in fact be a key factor influencing comment rankings.
In the previous system, my voting strategy roughly reflected the product of (how glad I was some comment was written) and (how much I agreed with it). I think this product better approximates my overall sense of how much I want to recommend people read the comment—since all else equal, I do want to recommend comments more insofar as I agree with them more.
I would be extremely surprised if karma does not track with agreement votes in the majority of cases. I only expect them to diverge in a narrow range of cases like excellently stated arguments people disagree with, extremely banal comments that are true but don't really add anything, actual voting, and high social conflict posts. If we can operationalize this prediction I'm interested in a bet.
all else equal, I do want to recommend comments more insofar as I agree with them more
It's a fair point. Sometimes the point of a thread is to discuss and explore a topic, and sometimes the point of a thread is to locally answer a question. In the former I want to reward the most surprising and new marginal information over the most obvious info. In the latter I just want to see the answer.
I'll definitely keep my eye out for whether this system breaks some threads, though it seems likely to me that "producing the right answer in a thread about answering a question" will be correctly upvoted in that context.
Partly because I find it cluttered and confusing, but also because I think audience agreement/disagreement should in fact be a key factor influencing comment rankings.
I have a different ontology here. I'd say that "truth-tracking" is pretty different from "true". A comment section with just the audience's main beliefs highly upvoted is different from one where the conversational moves that seem truth-tracking are highly upvoted. The former leans more easily into an echo-chamber than the latter, which better rewards side-ways moves and thoughtful arguments for positions most people disagree with.
It's true some CFAR staff have used psychedelics, and I'm sure they've sometimes mentioned that in private conversation. But CFAR as an institution never advocated psychedelic use, and that wasn't just because it was illegal, it was because (and our mentorship and instructor trainings emphasize this) psychedelics often harm people.
I agree manager/staff relations have often been less clear at CFAR than is typical. But I'm skeptical that's relevant here, since as far as I know there aren't really even borderline examples of this happening. The closest example to something like this I can think of is that staff occasionally invite their partners to attend or volunteer at workshops, which I think does pose some risk of fucky power dynamics, albeit dramatically less risk imo than would be posed by "the clear leader of an organization, who's revered by staff as a world-historically import...
I also feel really frustrated that you wrote this, Anna. I think there are a number of obvious and significant disanalogies between the situations at Leverage versus MIRI/CFAR. There's a lot to say here, but a few examples which seem especially salient:
I endorse Adam's commentary, though I did not feel the frustration Eli and Adam report, possibly because I know Anna well enough that I reflexively did the caveating in my own brain rather than modeling the audience.
Yeah, sorry. I agree that my comment “the OP speaks for me” is leading a lot of people to false views that I should correct. It’s somehow tricky because there’s a different thing I worry will be obscured by my doing this, but I’ll do it anyhow as is correct and try to come back for that different thing later.
To the best of my knowledge, the leadership of neither MIRI nor CFAR has ever slept with a subordinate, much less many of them.
Agreed.
...While I think staff at CFAR and MIRI probably engaged in motivated reasoning sometimes wrt PR, neither org eng
I like the local discourse norm of erring on the side of assuming good faith, but like steven0461, in this case I have trouble believing this was misleading by accident. Given how obviously false, or at least seriously misleading, many of these claims are (as I think accurately described by Anna/Duncan/Eli), my lead hypothesis is that this post was written by a former staff member, who was posing as a current staff member to make the critique seem more damning/informed, who had some ax to grind and was willing to engage in deception to get it ground, or something like that...?
It seems misleading in a non-accidental way, but it seems fairly plausible that their main motive was to obscure their identity.
FYI I just interpreted it to mean "former staff member" automatically. (This is biased by my belief that CFAR has very few current staff members so of course it was highly unlikely to be one, but I don't think it was an unreasonably weird reading)
Sure, but they led with "I'm a CFAR employee," which suggests they are a CFAR employee. Is this true?
It sounds like they meant they used to work at CFAR, not that they currently do.
Also given the very small number of people who work at CFAR currently, it would be very hard for this person to retain anonymity with that qualifier so...
I think it's safe to assume they were a past employee... but they should probably update their comment to make that clearer because I was also perplexed by their specific phrasing.
I've worked at CFAR for most of the last 5 years, and this comment strikes me as so wildly incorrect and misleading that I have trouble believing it was in fact written by a current CFAR employee. Would you be willing to verify your identity with some mutually-trusted 3rd party, who can confirm your report here? Ben Pace has offered to do this for people in the past.
I don't know if you trust me, but I confirmed privately that this person is a past or present CFAR employee.
Are you tempted to drop or reduce the size of this trade in light of the UK seeming to have (roughly speaking, for now at least) contained B.1.1.7?
Yeah, makes sense. Fwiw, I have encountered one purportedly 97+ CRI lamp that looked awful to me.
I really appreciate you writing this!
Just wanted to add that my informal impression from a few experiments is that the difference between 90 CRI and 95+ CRI is actually large.
Thanks!
Sounds about right for CRI. I think there are a couple things going on with it:
I'm not sure how much of it is bad measurement and how much of it is CRI being a poor metric, but the best 85 CRI bulbs I've seen are substantially better than the worst 90 CRI bulbs, which is why I'm hesitant to tell people to rule out 85 CRI bulbs entirely. I've not encountered any 95 CRI bulbs that are bad, so maybe the better advice is just to go for 95+ CRI whenever possible.
Another (unlikely, but more likely than almost all other ancient people) candidate for partial future revival: During the 79 AD eruption of Vesuvius, part of this man's brain was vitrified.
Your posts about the neocortex have been a plurality of the posts I've been most excited to read this year. I'm super interested in the questions you're asking, and it drives me nuts that they're not asked more in the neuroscience literature.
But there's an aspect of these posts I've found frustrating, which is something like the ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."
Interestingly, I also have this gripe when reading Friston and Hawkins. And I feel like I als...
Your posts about the neocortex have been a plurality of the posts I've been most excited reading this year.
Thanks so much, that really means a lot!!
...ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."
I agree with "theories/frameworks relatively scarce". I don't feel like I have multiple gears-level models of how the brain might work, and I'm trying to figure out which one is right. I feel like I have zero, and I'm trying to grope my way towards one. It's almost more li...
Have you thought much about whether there are parts of this research you shouldn't publish?
Yeah, sure. I have some ideas about the gory details of the neocortical algorithm that I haven't seen in the literature. They might or might not be correct and novel, but at any rate, I'm not planning to post them, and I don't particularly care to pursue them, under the circumstances, for the reasons you mention.
Also, there was one post that I sent for feedback to a couple people in the community before posting, out of an abundance of caution. Neither person saw it a...
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms]
I do mention interpreting the described results as tentative evidence for mesa-optimization, and this interpretation was why I wrote the post; my impression is still that this interpretation was basically correct. But most of the post is just quotes or paraphrased claims made by DeepMind researchers, rather than my own claims, since I didn't feel sure enough to make the claims myself.
I feel confused about why, on this model, the researchers were surprised that this occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just weren't very familiar with AI. But looking at the author list, and their publications (e.g.1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. Most of the co-authors are neuroscientists by training, but a few have CS degrees, and all but one have co-authored previous ML papers. It's hard for me to i...
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
That said, I feel confused by a number of your arguments, so I'm working on a reply. Before I post it, I'd be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.
I currently understand you to be making four main claims:
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
Thanks. I know I came off pretty confrontational, sorry about that. I didn't mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.
I don't think you've exactly captured what I meant, some comments below.
The system is just doing the totally normal thing “co...
The scenario I had in mind was one where death occurs as a result of damage caused by low food consumption, rather than by suicide.
One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I'd expect that a failure mode leading to systematically low DA rewards would usually be corrected
I'm not sure divorce like this is rare. For example, anorexia sometimes causes people to find food anti-rewarding (repulsive/inedible, even when they're dying and don't to be), and I can imagine that being because PFC actually somehow alters DAs reward function.
But I do share the hunch that something like a "divorce resistance" trick occurs a...
I think it makes more sense to operationalize "catastrophic" here as "leading to systematically low DA reward
Thanks—I do think this operationalization makes more sense than the one I proposed.
Kaj, the point I understand you to be making is: "The inner RL algorithm in this scenario is probably reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter's objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer, causing it to reconfigure itself to be better aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly." Does ...
Ah, I see. The high death rate was what made it seem often-catastrophic to me. Is your objection that the high death rate doesn't reflect something that might reasonably be described as "optimizing for one goal at the expense of all others"? E.g., because many of the deaths are suicides, in which case persistence may have been net negative from the perspective of the rest of their goals too? Or because deaths often result from people calibratedly taking risky but non-insane actions, who just happened to get unlucky with heart muscle integrity or whatever?
Yeah, I wrote that confusingly, sorry; edited to clarify. I just meant that of the limited set of candidate examples I'd considered, my model of anorexia, which of course may well be wrong, feels most straightforwardly like an example of something capable of causing catastrophic within-brain inner alignment failure. That is, it currently feels natural to me to model anorexia as being caused by an optimizer for thinness arising in brains, which can sometimes gain sufficient power that people begin to optimize for that goal at the expense of essentially all other goals. But I don't feel confident in this model.
I agree, in the case of evolution/humans. I meant to highlight what seemed to me like a relative lack of catastrophic within-mind inner alignment failures, e.g. due to conflicts between PFC and DA. Death of the organism feels to me like one reasonable way to operationalize "catastrophic" in these cases, but I can imagine other reasonable ways.
As I understand it, your point about the distinction between "mesa" and "steered" is chiefly that in the latter case, the inner layer is continually receiving reward signal from the outer layer, which in effect heavily restricts the space of possible algorithms the outer layer might give rise to. Does that seem like a decent paraphrase?
One of the aspects of Wang et al.'s paper that most interested me was that the inner layer in their meta-RL model kept learning even once reward signal from the outer layer had ceased. It feels plausible to me that the relat...
It could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don't regularly experience catastrophic inner alignment failures internally.
In practice I do suspect humans regularly experience internal inner alignment failures, but given that suspicion I feel surprised by how functional humans do manage to be. In other words, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.
In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.
I don't know why you expect an inner alignment failure to look dysfunctional. Instrumental convergence suggests that it would look functional. What the world looks like if there...
The thing I meant by "catastrophic" is just "leading to death of the organism." I suspect mesa-optimization is common in humans, but I don't feel confident about this, nor that this is a joint-carvey ontology. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just "everyday personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomena which might reasonably be described as inner alignment failures.
But I think these things don't kill people very...
Governments and corporations experience inner alignment failures all the time, but because of convergent instrumental goals, they are rarely catastrophic. For example, Russia underwent a revolution and a civil war on the inside, followed by purges and coups etc., but from the perspective of other nations, it was more or less still the same sort of thing: A nation, trying to expand its international influence, resist incursions, and conquer more territory. Even its alliances were based as much on expediency as on shared ideology.
Perhaps something similar happens with humans.
For similar reasons, I allocate a small portion of my portfolio toward assets (including Nvidia) that might appreciate rapidly during slow takeoff, in the thinking that there might be some slow takeoff scenarios in which the extra resources prove helpful. My main reservation is Paul Christiano's argument that investment/divestment has more-than-symbolic effects.
I found LinkedIn's background breakdown of DeepMind employees interesting; fewer neuroscience backgrounds than I would have expected.
I found this post super interesting, and appreciate you writing it. I share the suspicion/hope that gaining better understanding of brains might yield safety-relevant insights.
I’m curious what you think is going on here that seems relevant to inner alignment. Is it that you’re modeling neocortical processes (e.g. face recognizers in visual cortex) as arising from something akin to search processes conducted by similar subcortical processes (e.g. face recognizers in superior colliculus), and noting that there doesn’t seem to be much divergence between their objective functions, perhaps because of helpful features of subcortex-supervised learning like e.g. these subcortical input-dependent dynamic rewiring rules?
I wouldn't describe any posts I've seen as conveying the idea sufficiently well for my taste, but would describe some—like this NY Times piece—as adequately conveying the most decision-relevant points.
When I started writing, there was almost no discussion online (aside from Wei Dai's comment here, and the posts it links to) about what factors might prove limiting for the provision of hospital care, or about the degree to which those limits might be exceeded. By the time I called off the project, the US President and ~every major newspaper were talking abou...
Update: We decided not to finish this post, since the points we wished to convey have now mostly been covered well elsewhere; Kyle may still write up his notes about the epidemiological parameters at some point.
I'm currently working with Kyle Scott and Anna Salamon on an estimate of deaths due to hospital overflow (lack of access to oxygen, mechanical ventilation, ICU beds), which we'll hopefully post in the next few days. The post will review evidence about basic epidemiological parameters.
I've been trying to spend a bit more time voting in response to this, to try to help keep thread quality high; at least for now, the size of the influx strikes me as low enough that a few long-time users doing this might help a bunch.