Posts

Sorted by New

Wiki Contributions

Comments

aysja3h40

Huh, I feel confused. I suppose we just have different impressions. Like, I would say that Oliver is exceedingly good at cutting through the bullshit. E.g., I consider his reasoning around shutting down the Lightcone offices to be of this type, in that it felt like a very straightforward document of important considerations, some of which I imagine were socially and/or politically costly to make. One way to say that is that I think Oliver is very high integrity, and I think this helps with bullshit detection: it's easier to see how things don't cut to the core unless you deeply care about the core yourself. In any case, I think this skill carries over to object-level research, e.g., he often seems, to me, to ask cutting-to-the core type questions there, too. I also think he's great at argument: legible reasoning, identifying the important cruxes in conversations, etc., all of which makes it easier to tell the bullshit from the not. 

I do not think of Oliver as being afraid to be disagreeable, and ime he gets to the heart of things quite quickly, so much so that I found him quite startling to interact with when we first met. And although I have some disagreements over Oliver's past walled-garden taste, from my perspective it's getting better, and I am increasingly excited about him being at the helm of a project such as this. Not sure what to say about his beacon-ness, but I do think that many people respect Oliver, Lightcone, and rationality culture more generally; I wouldn't be that surprised if there were an initial group of independent researcher types who were down and excited for this project as is. 

aysja2d120

This is very cool! I’m excited to see where it goes :)

A couple questions (mostly me grappling with what the implications of this work might be):

  • Given a dataset of sequences of tokens, how do you find the HMM that could have generated it, and can this be done automatically? Also, is the mapping from dataset to HMM unique?
  • This question is possibly more confused on my end, sorry if so. I’m trying to get at something like “how interpretable will these simplexes be with much larger models?” Like, if I’m imagining that each state is a single token, and the HMM is capable of generating the totality of data the model sees, then I’m imagining something quite unwieldy, i.e., something with about the amount of complexity and interpretability as, e.g., the signaling cascade networks in a cell. Is this imagination wrong? Or is it more like, you start with this unwieldy structure (but which has some nice properties nonetheless), and then from there you try to make the initial structure more parse-able? Maybe a more straightforward way to ask: you say you’re interested in formalizing things like situational awareness with these tools—how might that work?
aysja22d132

I guess I'm not sure what you mean by "most scientific progress," and I'm missing some of the history here, but my sense is that importance-weighted science happens proportionally more outside of academia. E.g., Einstein did his miracle year outside of academia (and later stated that he wouldn't have been able to do it, had he succeeded at getting an academic position), Darwin figured out natural selection, and Carnot figured out the Carnot cycle, all mostly on their own, outside of academia. Those are three major scientists who arguably started entire fields (quantum mechanics, biology, and thermodynamics). I would anti-predict that future scientific progress, of the field-founding sort, comes primarily from people at prestigious universities, since they, imo, typically have some of the most intense gatekeeping dynamics which make it harder to have original thoughts. 

aysja1mo154

I don’t see how the cluster argument resolves the circularity problem. 

The circularity problem, as I see it, is that your definition of an abstraction shouldn’t be dependent on already having the abstraction. I.e., if the only way to define the abstraction “dog” involves you already knowing the abstraction “dog” well enough to create the set of all dogs, then probably you’re missing some of the explanation for abstraction. But the clusters in thingspace argument also depends on having an abstraction—knowing to look for genomes, or fur, or bark, is dependent on us already understanding what dogs are like. After all, there are nearly infinite “axes” one could look at, but we already know to only consider some of them. In other words, it seems like this has just passed the buck from choice of object to choice of properties, but you’re still making that choice based on the abstraction. 

The fact that choice of axis—from among the axes we already know to be relevant—is stable (i.e., creates the same clusterings) feels like a central and interesting point about abstractions. But it doesn’t seem like it resolves the circularity problem. 

(In retrospect the rest of this comment is thinking-out-loud for myself, mostly :p but you might find it interesting nonetheless). 

I think it’s hard to completely escape this problem—we need to use some of our own concepts when understanding the territory, as we can’t see it directly—but I do think it’s possible to get a bit more objective than this. E.g., I consider thermodynamics/stat mech to be pretty centrally about abstractions, but it does so in a way that feels more “territory first,” if that makes any sense. Like, it doesn’t start with the conclusion. It started with the observation that “heat moves stuff” and “what’s up with that” and then eventually landed with an analysis of entropy involving macrostates. Somehow that progression feels more natural to me than starting with “dogs are things” and working backwards. E.g., I think I’m wanting something more like “if we understand these basic facts about the world, we can talk about dogs” rather than “if we start with dogs, we can talk sensibly about dogs.” 

To be clear, I consider some of your work to be addressing this. E.g., I think the telephone theorem is a pretty important step in this direction. Much of the stuff about redundancy and modularity feels pretty tip-of-the-tongue onto something important, to me. But, at the very least, my goal with understanding abstractions is something like “how do we understand the world such that abstractions are natural kinds”? How do we find the joints such that, conditioning on those, there isn’t much room to vary? What are those joints like? The reason I like the telephone theorem is that it gives me one such handle: all else equal, information will dissipate quickly—anytime you see information persisting, it’s evidence of abstraction. 

My own sense is that answering this question will have a lot more to do with how useful abstractions are, rather than how predictive/descriptive they are, which are related questions, but not quite the same. E.g., with the gears example you use to illustrate redundancy, I think the fact that we can predict almost everything about the gear from understanding a part of it is the same reason why the gear is useful. You don’t have to manipulate every atom in the gear to get it to move, you only have to press down on one of the… spokes(?), and the entire thing will turn. These are related properties. But they are not the same. E.g., you can think about the word “stop” as an abstraction in the sense that many sound waves map to the same “concept,” but that’s not very related to why the sound wave is so useful. It’s useful because it fits into the structure of the world: other minds will do things in response to it.
 
I want better ways to talk about how agents get work out of their environments by leveraging abstractions. I think this is the reason we ultimately care about them ourselves; and why AI will too. I also think it’s a big part of how we should be defining them—that the natural joint is less “what are the aggregate statistics of this set” but more “what does having this information allow us to do”? 

aysja1mo62

I think it’s pretty unlikely that Anthropic’s murky strategy is good. 

In particular, I think that balancing building AGI with building AGI safely only goes well for humanity in a pretty narrow range of worlds. Like, if safety is relatively easy and can roughly keep pace with capabilities, then I think this sort of thing might make sense. But the more the expected world departs from this—the more that you might expect safety to be way behind capabilities, and the more you might expect that it’s hard to notice just how big that gap is and/or how much of a threat capabilities pose—the more this strategy starts seeming pretty worrying to me.  

It’s worrying because I don’t imagine Anthropic gets that many “shots” at playing safety cards, so to speak. Like, implementing RSPs and trying to influence norms is one thing, but what about if they notice something actually-maybe-dangerous-but-they’re-not-sure as they’re building? Now they’re in this position where if they want to be really careful (e.g., taking costly actions like: stop indefinitely until they’re beyond reasonable doubt that it’s safe) they’re most likely kind of screwing their investors, and should probably expect to get less funding in the future. And the more likely it is, from their perspective, that the behavior in question does end up being a false alarm, the more pressure there is to not do due diligence. 

But the problem is that the more ambiguous the situation is—the less we understand about these systems—the less sure we can be about whether any given behavior is or isn’t an indication of something pretty dangerous. And the current situation seems pretty ambiguous to me. I don’t think anyone knows, for instance, whether Claude 3 seeming to notice it’s being tested is something to worry about or not. Probably it isn’t. But really, how do we know? We’re going off of mostly behavioral cues and making educated guesses about what the behavior implies. But that really isn’t very reassuring when we’re building something much smarter than us, with potentially catastrophic consequences. As it stands, I don’t believe we can even assign numbers to things in a very meaningful sense, let alone assign confidence above a remotely acceptable threshold, i.e., some 9’s of assurance that what we’re about to embark on won’t kill everyone.    

The combination of how much uncertainty there is in evaluating these systems, and how much pressure there is for Anthropic to keep scaling seems very worrying to me. Like, if there’s a very obvious sign that a system is dangerous, then I believe Anthropic might be in a good position to pause and “sound the alarm.” But if things remain kind of ambiguous due to our lack of understanding, as they seem to me now, then I’m way less optimistic that the outcome of any maybe-dangerous-but-we’re-not-sure behavior is that Anthropic meaningfully and safely addresses it. In other words, I think that given our current state of understanding, the murky strategy favors “build AGI” more than it does “build AGI safely” and that scares me. 

I also think the prior should be quite strong, here, that the obvious incentives will have the obvious effects. Like, creating AGI is desirable (so long as it doesn’t kill everyone and so on). Not only on the “loads of money” axis, but also along other axes monkeys care about: prestige, status, influence, power, etc. Yes, practically no one wants to die, and I don’t doubt that many people at Anthropic genuinely care and are worried about this. But, also, it really seems like you should a priori expect that with stakes this high, cognition will get distorted around whether or not to pursue the stakes. Maybe all Anthropic staff are perfectly equipped to be epistemically sane in such an environment, but I don’t think that one should on priors expect it. People get convinced of all kinds of things when they have a lot to gain, or a lot to lose. 

Anyway, it seems likely to me that we will continue to live in the world where we don’t understand these systems well enough to be confident in our evaluations of them, and I assign pretty significant probability to the worlds where capabilities far outstrip our alignment techniques, so I am currently not thrilled that Anthropic exists. I expect that their murky strategy is net bad for humanity, given how the landscape currently looks. 

Maybe you really do need to iterate on frontier AI to do meaningful safety work.

This seems like an open question that, to my mind, Anthropic has not fully explored. One way that I sometimes think about this is to ask: if Anthropic were the only leading AI lab, with no possibility of anyone catching up any time soon, should they still be scaling as fast as they are? My guess is no. Like, of course the safety benefit to scaling is not zero. But it’s a question of whether the benefits outweigh the costs. Given how little we understand these systems, I’d be surprised if we were anywhere near to hitting diminishing safety returns—as in, I don’t think the safety benefits of scaling vastly outstrip the benefit we might expect out of empirical work on current systems. And I think the potential cost of scaling as recklessly we currently are is extinction. I don’t doubt that at some point scaling will be necessary and important for safety; I do doubt that the time for that is now. 

Maybe you do need to stay on the frontier because the world is accelerating whether Anthropic wants it to or not.

It really feels like if you create an organization which, with some unsettlingly large probability, might directly lead to the extinction of humanity, then you’re doing something wrong. Especially so, if the people that you’re making the decisions for (i.e., everyone), would be—if they fully understood the risks involved—on reflection unhappy about it. Like, I’m pretty sure that the sentence from Anthropic’s pitch deck “these models could begin to automate large portions of the economy” is already enough for many people to be pretty upset. But if they learned that Anthropic also assigned ~33% to a “pessimistic world” which includes the possibility “extinction” then I expect most people would rightly be pretty furious. I think making decisions for people in a way that they would predictably be upset about is unethical, and it doesn’t make it okay just because other people would do it anyway.  

In any case, I think that Anthropic’s existence has hastened race dynamics, and I think that makes our chances of survival lower. That seems pretty in line with what to expect from this kind of strategy (i.e., that it cashes out to scaling coming before safety where it’s non-obvious what to do), and I think it makes sense to expect things of this type going forward (e.g., I am personally pretty skeptical that Anthropic is going to ~meaningfully pause development unless it’s glaringly obvious that they should do so, at which point I think we’re clearly in a pretty bad situation). And although OpenAI was never claiming as much of a safety vibe as Anthropic currently is, I still think the track record of ambiguous strategies which play to both sides does not inspire that much confidence about Anthropic’s trajectory. 

Does Dario-and-other-leadership have good models of x-risk?

I am worried about this. My read on the situation is that Dario is basically expecting something more like a tool than an agent. Broadly, I get this sense because when I model Anthropic as operating under the assumption that risks mostly stem from misuse, their actions make a lot more sense to me. But also things like this quote seem consistent with that: “I suspect that it may roughly work to think of the model as if it's trained in the normal way, just getting to above human level, it may be a reasonable assumption… that the internal structure of the model is not intentionally optimizing against us.” (Dario on the Dwarkesh podcast). If true, this makes me worried about the choices that Dario is going to make, when, again, it’s not clear how to interpret the behavior of these systems. In particular, it makes me worried he’s going to err on the side of “this is probably fine,” since tools seem, all else equal, less dangerous than agents. Dario isn’t the only person Anthropic’s decisions depend on, still, I think his beliefs have a large bearing on what Anthropic does. 

But, the way I wish the conversation was playing out was less like "did Anthropic say a particular misleading thing?"

I think it’s pretty important to call attention to misleading things. Both because there is some chance that public focus on inconsistencies might cause them to change their behavior, and because pointing out specific problems in public arenas often causes evidence to come forward in one common space, and then everyone can gain a clearer understanding of what’s going on. 

aysja1mo148

Things like their RSP rely on being upheld in spirit, not only in letter.

This is something I’m worried about. I think that Anthropic’s current RSP is vague and/or undefined on many crucial points. For instance, I feel pretty nervous about Anthropic’s proposed response to an evaluation threshold triggering. One of the first steps is that they will “conduct a thorough analysis to determine whether the evaluation was overly conservative,” without describing what this “thorough analysis” is, nor who is doing it. 

In other words, they will undertake some currently undefined process involving undefined people to decide whether it was a false alarm. Given how much is riding on this decision—like, you know, all of the potential profit they’d be losing if they concluded that the model was in fact dangerous—it seems pretty important to be clear about how these things will be resolved. 

Instituting a policy like this is only helpful insomuch as it meaningfully constrains the company’s behavior. But when the response to evaluations are this loosely and vaguely defined, it’s hard for me to trust that the RSP cashes out to more than a vague hope that Anthropic will be careful. It would be nice to feel like the Long Term Benefit Trust provided some kind of assurance against this. But even this seems difficult to trust when they’ve added “failsafe provisions” that allow a “sufficiently large” supermajority of stockholders to make changes to the Trust’s powers (without the Trustees consent), and without saying what counts as “sufficiently large.”  

aysja1mo64

It seems plausible that this scenario could happen, i.e., that Anthropic and OpenAI end up in a stable two-player oligopoly. But I would still be pretty surprised if Anthropic's pitch to investors, when asking for billions of dollars in funding, is that they pre-commit to never release a substantially better product than their main competitor. 

aysja1mo60

I agree that this is a plausible read of their pitch to investors, but I do think it’s a bit of a stretch to consider it the most likely explanation. It’s hard for me to believe that Anthropic would receive billions of dollars in funding if they're explicitly telling investors that they’re committing to only release equivalent or inferior products relative to their main competitor.

aysja1mo120

I assign >50% that Anthropic will at some point pause development for at least six months as a result of safety evaluations. 

Reply1443
Load More