Modeling Transformative AI Risk (MTAIR)

Wiki Contributions


Yes, I'm mostly embracing simulator theory here, and yes, there are definitely a number of implicit models of the world within LLMs, but they aren't coherent. So I'm not saying there is no world model, I'm saying it's not a single / coherent model, it's a bunch of fragments.

But I agree that it doesn't explain everything! 

To step briefly out of the simulator theory frame, I agree that part of the problem is next-token generation, not RLHF - the model is generating the token, so it can't "step back" and decide to go back and not make the claim that it "knows" should be followed by a citation. But that's not a mistake on the level of simulator theory, it's a mistake because of the way the DNN is used, not the joint distribution implicit in the model, which is what I view as "actually" what is simulated. For example, I suspect that if you were to have it calculate the joint probability over all the possibilities for the next 50 tokens at a time, and pick the next 10 based on that, then repeat, (which would obviously be computationally prohibitive, but I'll ignore that for now,) it would mostly eliminate the hallucination problem.

On racism, I don't think there's much you need to explain; they did fine-tuning, and that was able to generate the equivalent of insane penalties for words and phrases that are racist. I think it's possible that RHLF could train away from the racist modes of thinking as well, if done carefully, but I'm not sure that is what occurs.

No, it was and is a global treaty enforced multilaterally, as well as a number of bans on testing and arms reduction treaties. For each, there is a strong local incentive for states - including the US - to defect, but the existence of a treaty allows global cooperation.

With AGI, of course, we have strong reasons to think that the payoff matrix looks something like the following:

(0,0) (-∞, 5-∞)
(5-∞, -∞)  (-∞, -∞)

So yes, there's a local incentive to defect, but it's actually a prisoner's dilemma where the best case for defecting is identical to suicide.

We decided to restrict nuclear power to the point where it's rare in order to prevent nuclear proliferation. We decided to ban biological weapons, almost fully successfully. We can ban things that have strong local incentives, and I think that ignoring that, and claiming that slowing down or stopping can't happen, is giving up on perhaps the most promising avenue for reducing existential risk from AI. (And this view helps in accelerating race dynamics, so even if I didn't think it was substantively wrong, I'd be confused as to why it's useful to actively promote it as an idea.)

I think the post addresses a key objection that many people opposed to EA and longtermist concerns have voiced with the EA view of AI, and thought it was fairly well written to make the points it made, without also making the mostly unrelated point that you wanted it to have addressed.

Getting close to the decade anniversary for Why the Tails Come Apart, and this is a very closely related issue to regressional Goodhart.

AI safety "thought" is more-or-less evenly distributed

Agreed - I wasn't criticizing AI safety here, I was talking about the conceptual models that people outside of AI safety have - as was mentioned in several other comments. So my point was about what people outside of AI safety think about when talking about ML models, trying to correct a broken mental model.

So, I disagree that evals and red teaming in application to AI are "meaningless" because there are no standards. 

I did not say anything about evals and red teaming in application to AI, other than in comments where I said I think they are a great idea. And the fact that they are happening very clearly implies that there is some possibility that the models perform poorly, which, again, was the point.

You seem to try to import quite an outdated understanding of safety and reliability engineering. 

Perhaps it's outdated, but it is the understanding which engineers who I have spoken to who work on reliability and systems engineering actually have, and it matches research I did on resilience most of a decade ago, e.g. this. And I agree that there is discussion in both older and more recent journal articles about how some firms do things in various ways that might be an improvement, but it's not the standard. And even when doing agile systems engineering, use cases more often supplement or exist alongside requirements, they don't replace them. Though terminology in this domain is so far from standardized that you'd need to talk about a specific company, or even a specific project's process and definitions to have a more meaningful discussion.

Standards are impossible in the case of AI, but they are also unnecessary, as was evidenced by the experience in various domains of engineering, where the presence of standards doesn't "save" one from drifting into failures.

I don't disagree with the conclusion, but the logic here simply doesn't work to prove anything. It implies that standards are insufficient, not that they are not necessary.

I do think that some people are clearly talking about meanings of the word "safe" that aren't so clear-cut (e.g. Sam Altman saying GPT-4 is the safest model yet™️), and in those cases I agree that these statements are much closer to "meaningless".


The people in the world who actually build these models are doing the thing that I pointed out. That's the issue I was addressing.

People do actually have a somewhat-shared set of criteria in mind when they talk about whether a thing is safe, though, in a way that they (or at least I) don't when talking about its qwrgzness. e.g., if it kills 99% of life on earth over a ten year period, I'm pretty sure almost everyone would agree that it's unsafe. No further specification work is required. It doesn't seem fundamentally confused to refer to a thing as "unsafe" if you think it might do that.

I don't understand this distinction. If " I'm pretty sure almost everyone would agree that it's unsafe," that's an informal but concrete ability for the system to be unsafe, and it would not be confused to say something is unsafe if you think it could do that, nor to claim that it is safe if you have clear reason to believe it will not.

My problem is, as you mentioned, that people in the world of ML are not making that class of claim. They don't seem to ground their claims about safety in any conceptual model about what the risks or possible failures are whatsoever, and that does seem fundamentally confused.

I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.


That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.

As an aside, mirror cells aren't actually a problem, and non-mirror digestive systems and immune systems can break them down, albeit with less efficiency. Church's early speculation that these cells would not be digestible by non-mirror life forms doesn't actually make work, per several molecular biologists I have spoken to since then.

Load More