aysja

Wikitag Contributions

Comments

Sorted by
aysja140

I don't think that most people, upon learning that Anthropic's justification was "other companies were already putting everyone's lives at risk, so our relative contribution to the omnicide was low" would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like "it's not okay to risk extinction, period." That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they'd overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.    

aysja20

Technical progress also has the advantage of being the sort of thing which could make a superintelligence safe, whereas I expect very little of this to come from institutional competency alone. 

aysja120

I have grown pessimistic about our ability to solve the open technical problems even given 100 years of work on them. 

Why? 

aysja*2511

I agree with a bunch of this post in spirit—that there are underlying patterns to alignment deserving of true name—although I disagree about… not the patterns you’re gesturing at, exactly, but more how you’re gesturing at them. Like, I agree there’s something important and real about “a chunk of the environment which looks like it’s been optimized for something,” and “a system which robustly makes the world look optimized for a certain objective, across many different contexts.” But I don't think the true names of alignment will be behaviorist (“as if” descriptions, based on observed transformations between inputs and output). I.e., whereas you describe it as one subtlety/open problem that this account doesn’t “talk directly about what concrete patterns or tell-tale signs make something look like it's been optimized for X,” my own sense is that this is more like the whole problem (and also not well characterized as a coherence problem). It’s hard for me to write down the entire intuition I have about this, but some thoughts:

  • Behaviorist explanations don’t usually generalize as well. Partially this is because there are often many possible causal stories to tell about any given history of observations. Perhaps the wooden sphere was created by a program optimizing for how to fit blocks together into a sphere, or perhaps the program is more general (about fitting blocks together into any shape) or more particular (about fitting these particular wood blocks together) etc. Usually the response is to consider the truth of the matter to be the shortest program consistent with all observations, but this is enables blindspots, since it might be wrong! You don’t actually know what’s true when you use procedures like this, since you’re not looking at the mechanics of it directly (the particular causal process which is in fact happening). But knowing the actual causal process is powerful, since it will tell you how the outputs will vary with other inputs, which is an important component to ensuring that unwanted outputs don’t obtain.
    • This seems important to me for at least a couple reasons. One is that we can only really bound the risk if we know what the distribution of possible outputs is. This doesn't necessarily require understanding the underlying causal process, but understanding the underlying causal process will, I think, necessarily give you this (absent noise/unknown unknowns etc). Two is that blindspots are exploitable—anytime our measurements fail to capture reality, we enable vectors of deception. I think this will always be a problem to some extent, but it seems worse to me the less we have a causal understanding. For instance, I’m more worried about things like “maybe this program represents the behavior, but we’re kind of just guessing based on priors” than I am about e.g., calculating the pressure of this system. Because in the former there are many underlying causal processes (e.g., programs) that map to the same observation, whereas in the latter it's more like there are many underlying states which do. And this is pretty different, since the way you extrapolate from a pressure reading will be the same no matter the microstate, but this isn’t true of the program: different ones suggest different future outcomes. You can try to get around this by entertaining the possibility that all programs which are currently consistent with the observations might be correct, weighted by their simplicity, rather than assuming a particular one. But I think in practice this can fail. E.g., scheming behavior might look essentially identical to non-scheming behavior for an advanced intelligence (according to our ability to estimate this), despite the underlying program being quite importantly different. Such that to really explain whether a system is aligned, I think we'll need a way of understanding the actual causal variables at play.  
  • Many accounts of cognition are impossible (eg AIXI, VNM rationality, or anything utilizing utility functions, many AIT concepts), since they include the impossible step of considering all possible worlds. I think people normally consider this to be something like a “God’s eye view” of intelligence—ultimately correct, but incomputable—which can be projected down to us bounded creatures via approximation, but I think this is the wrong sort of in-principle to real-world bridge. Like, it seems to me that intelligence is fundamentally about ~“finding and exploiting abstractions,” which is something that having limited resources forces you to do. I.e., intelligence comes from the boundedness. Such that the emphasis should imo go the other way: figuring out the core of what this process of “finding and exploiting abstractions” is, and then generalizing outward. This feels related to behaviorism insomuch as behaviorist accounts often rely on concepts like “searching over the space of all programs to find the shortest possible one.” 
aysjaΩ6102

By faithful and human-legible, I mean that the model’s reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the model’s actions. 

Curious why you say this, in particular the bit about “accurately reflects the reasons for the model’s actions.” How do you know that? (My impression is that this sort of judgment is often based on ~common sense/folk reasoning, ie, that we judge this in a similar way to how we might judge whether another person took sensical actions—was the reasoning internally consistent, does it seem predictive of the outcome, etc.? Which does seem like some evidence to me, although not enough to say that it accurately reflects what's happening. But perhaps I'm missing something here). 

aysja2018

I purposefully use these terms vaguely since my concepts about them are in fact vague. E.g., when I say “alignment” I am referring to something roughly like “the AI wants what we want.” But what is “wanting,” and what does it mean for something far more powerful to conceptualize that wanting in a similar way, and what might wanting mean as a collective, and so on? All of these questions are very core to what it means for an AI system to be “aligned,” yet I don’t have satisfying or precise answers for any of them. So it seems more natural to me, at this stage of scientific understanding, to simply own that—not to speak more rigorously than is in fact warranted, not to pretend I know more than I in fact do. 

The goal, of course, is to eventually understand minds well enough to be more precise. But before we get there, most precision will likely be misguided—formalizing the wrong thing, or missing the key relationships, or what have you. And I think this does more harm than good, as it causes us to collectively misplace where the remaining confusion lives, when locating our confusion is (imo) one of the major bottlenecks to solving alignment.  

aysja16183

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details to large level structure, this is especially important for the alignment problem), not being efficient market pilled (you have to believe that more is possible in order to aim for it), noticing confusion, and probably a lot more that I’m failing to name here.

Most importantly, though, I think it requires quite a lot of willingness to remain confused. Many scientists who accomplished great things (Darwin, Einstein) didn’t have publishable results on their main inquiry for years. Einstein, for instance, talks about wandering off for weeks in a state of “psychic tension” in his youth, it took ~ten years to go from his first inkling of relativity to special relativity, and he nearly gave up at many points (including the week before he figured it out). Figuring out knowledge at the edge of human understanding can just be… really fucking brutal. I feel like this is largely forgotten, or ignored, or just not understood. Partially that's because in retrospect everything looks obvious, so it doesn’t seem like it could have been that hard, but partially it's because almost no one tries to do this sort of work, so there aren't societal structures erected around it, and hence little collective understanding of what it's like.

Anyway, I suspect there are really strong selection pressures for who ends up doing this sort of thing, since a lot needs to go right: smart enough, creative enough, strong epistemics, independent, willing to spend years without legible output, exceptionally driven, and so on. Indeed, the last point seems important to me—many great scientists are obsessed. Spend night and day on it, it’s in their shower thoughts, can’t put it down kind of obsessed. And I suspect this sort of has to be true because something has to motivate them to go against every conceivable pressure (social, financial, psychological) and pursue the strange meaning anyway.

I don’t think the EA pipeline is much selecting for pre-paradigmatic scientists, but I don’t think lack of trying to get physicists to work on alignment is really the bottleneck either. Mostly I think selection effects are very strong, e.g., the Sequences was, imo, one of the more effective recruiting strategies for alignment. I don’t really know what to recommend here, but I think I would anti-recommend putting all the physics post-docs from good universities in a room in the hope that they make progress. Requesting that the world write another book as good as the Sequences is a... big ask, although to the extent it’s possible I expect it’ll go much further in drawing people out who will self select into this rather unusual "job."

aysja67

I took John to be arguing that we won’t get a good solution out of this paradigm (so long as the humans doing it aren’t expert at alignment), rather than we couldn’t recognize a good solution if it were proposed.

Separately, I think that recognizing good solutions is potentially pretty fraught, especially the more capable the system we’re outsourcing it to is. Like anything about a proposed solution that we don’t know how to measure or we don’t understand could be exploited, and it’s really hard to tell those failures exist almost definitionally. E.g., a plan with many steps where it’s hard to verify there won’t be unintended consequences, a theory of interpretability which leaves out a key piece we'd need to detect deception, etc. etc. It’s really hard to know/trust that things like that won’t happen when we’re dealing with quite intelligent machines (potentially optimizing against us), and it seems hard to get this sort of labor out of not-very-intelligent machines (for similar reasons as John points out in his last post, i.e., before a field is paradigmatic outsourcing doesn’t really work, since it’s difficult to specify questions well when we don’t even know what questions to ask in the first place).

In general these sorts of outsourcing plans seem to me to rely on a narrow range of AI capability levels (one which I’m not even sure exists): smart enough to solve novel scientific problems, but not smart enough to successfully deceive us if it tried. That makes me feel pretty skeptical about such plans working.

Answer by aysja185

Jan suggested a similar one (Baraka), but I was going to say Koyaanisqatsi. It’s one of my favorite films; I still feel deeply affected by it. I bring it up here, though, because it does an excellent job of inviting viewers to do original seeing. It’s basically a 90 minute documentary about the world, but it doesn’t feel like it has any agenda. It’s just shot after shot of what this planet is like (the Grand Canyon, a commute to work, a factory farm). It doesn’t shy away from anything, doesn’t feel like it’s grasping at some goal. Just an honest, gentle look at what the world is like, and what humans are up to.

Part of the reason I say that it’s good at inviting original seeing is that it does a really excellent job of perspective modulation (especially wrt time). E.g., it’ll slow down or speed up processes in ways that made me pop out of how I normally relate to them. It lingers on features I wouldn't normally wouldn’t linger on (like someone’s face) which turned it into this entirely new and strange experience. In general, it takes the mundane and makes it into something kind of glorious—a piece of the world to be marveled at, to be wondered at, a thing to be curious about.

But it’s not just mundanity either; it reminds you that you’re in a vast universe, on a planet that not too long ago didn’t contain you. It starts with a close up of a cave painting, and it ends with this haunting scene of a rocket falling down to Earth. And I remember really grokking, at the end of it, just how strange and just how powerful a thing intelligence is—the magnitude of what we’ve accomplished. I’d had that feeling before, but something about it really stayed with me after watching this film.

aysja245

The interpretability safety case section seems to make the rather strong claim that the activation of known sabotage-related features is necessary for sabotage to occur (emphasis mine):

[C1.1] The model will be unable to accomplish organizational sabotage (T1) or to undermine oversight (T3) without triggering our feature-based monitoring (M1.1) and a detailed analysis of the session.

Afaict, the way that they establish this link (that a feature activation is necessary for a behavior to occur) is through ablation studies. Basically, if deactivating a feature prevents a model organism from committing a dangerous behavior, then they conclude that the feature causes the behavior. And although it does seem like this sort of test will generate Bayesian evidence about which features contribute to sabotage-capacity, I’m confused why the author thinks this would generate strong Bayesian evidence, much less identify the necessity conditions for sabotage with ~certainty?

Like, it seems plausible to me that some other unknown combination of features might create sabotage-like behavior, or that some relevant features were missed, or that there were other, ill-described-as-features mechanisms which can cause sabotage, etc. And my guess is that the evidence this sort of testing does generate will become weaker over time—i.e., once our safety precautions start to really matter—since more intelligent systems will presumably be able to cause sabotage in more numerous and creative ways.

Some of these potential setbacks are mentioned, and I get that this is just a preliminary sketch. But still, I’m confused where the confidence is coming from—why does the author expect this process to yield high confidence of safety?

Load More