Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic Status: Trying to clarify a confusion people outside of the AI safety community seem to have about what safety means for AI systems.

In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly? I think this is a conceptual error that I want to address.

"Verification and validation (also abbreviated as V&V) are independent procedures that are used together for checking that a product, service, or system meets requirements and specifications and that it fulfills its intended purpose." - Wikipedia

Both of these terms are used slightly differently across fields, but in general, verification is the process of making sure that the system fulfills the design requirements and/or other standards. This pre-supposes that the system has some defined requirements or a standard, at least an implicit one, and that it could fail to meet that bar. That is, the specification of the system includes what it must and must not do, and if the system does not do what it should, or does something that it should not, it fails.

Machine learning systems, especially language models, aren't well understood. The potential applications are varied and uncertain, entire classes of new and surprising failure modes are still being found, and we have nothing like a specification of what the system should or should not do, must or must not do, and where it can and cannot be used. 

To take a very concrete example, metal rods have safety characteristics, and they might be rated for use up to some weight limit, under some specific load for some amount of time, in certain temperature ranges, for some amount of time. These can all be tested. If the bar does not stay within a predefined range of characteristics at a given temperature, with a given load, it fails. It can also be found to be acceptable in one temperature range, but not another, or similar. At the end of verification and validation, the bar is deemed to have passed or failed for a given application, based on what the requirements for that larger system are.

At its best, red-teaming and safety audits of ML systems check lots of known failure modes, and determine whether they are susceptible. There is no pre-defined standard or set of characteristics that are checked, no real ability to consider application specific requirements, and no ability to specify where the system should not or must not be used.

Until we have some safety standard for machine learning models, they aren't "partly safe" or "assumed safe," or "good enough for consumer use." If we lack a standard for safety, ideally one where there is consensus that it is sufficient for a specific application, then exploration or verification of the safety of a machine learning model is meaningless. If a model is released to the public without a clear indication about what the system can safely be used for, with verification that it passed a relevant standard, and clear instruction that it cannot be used elsewhere, it is an unsafe model. Anyone who claims otherwise seems fundamentally confused about what safety means for such systems.

New Comment
28 comments, sorted by Click to highlight new comments since:
[-]VojtaKovarikΩ143014

My key point: I strongly agree with (my perception of) the intuitions behind this post, but I wouldn't agree with the framing. Or with the title. So I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.


To illustrate on an example: Suppose I want to use a metal rod for a  particular thingy in a car. And there is some safety standard for this, and the metal rod meets this standard.[1] And now suppose you instead have that same metal rod, except the safety standard does not exist. I expect most people to argue that your your car will be exactly as safe in both cases.

Now, the fact that the rod is equally safe in the two cases does not mean that using it in the two cases is equally smart. Indeed, my personal view is that without the safety standard, using the rod for your car would be quite dumb. (And using the metaphorical rod for your superintelligent AI would be f.....g insane.) But I think that saying things like "any model without safety standards is unsafe" is misleading, and likely to be unproductive when communicating with people who have a different mindset.[2]

  1. ^

    Then you might perhaps say that "the metal rod is safe for the purpose of being used for a particular thingy in a car"? However, this itself does not guarantee that the metal rod is actually safe for this purpose --- for that to be true, you additionally need the assumption that the safety standard is "reasonable". Where "reasonable" is undefined. And if you try to define it, you will eventually either need to rely on an informal/undefined/common-sense definition somewhere, or you need to figure out how to formalise literally everything. I am not giving up on that latter goal, but we aren't there yet :-).

  2. ^

    Personally, I think that being in any sort of decision-making or high-impact role while having such "different mindset" is analogous to driving a bus full of people without having a driver's license. However, what I think about this makes no meaningful impact on how things work in the real world...

I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.

 

That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.

[-]Max HΩ4126

I agree, and I made a related claim that alignment of a model is mostly a type error. I think the same point applies to safety more generally: modulo exotic concerns about inner-optimizers, there is no such thing as an unsafe model, only unsafe applications of that model.

Though I also think the red teaming that ARC did on GPT-4 actually did successfully demonstrate that GPT-4 is unlikely to fail in certain dangerous ways when integrated into any system which is not already dangerously capable in its own right. This is simultaneously a very impressive safety result for a novel AI model, and not very impressive in any other domain. For example, I'm not too impressed if a car manufacturer demonstrates that their car's entertainment system never takes control of the car and drives it off a cliff under any circumstances. (Though I still hope there is some auto manufacturing safety standard which covers such a case.)

Mostly agree.

I will note that correctly isolating the entertainment system from the car control system is one of those things you'd expect, but you'd be disappointed. Safety is hard.

In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly?

Factually, no, I don't think this is where most people's thoughts are. Apart from the stages of the engineering process that you enumerated, there are also manufacturing (i.e., training, in the case of ML systems) and operations (post-deployment phase). AI safety "thought" is more-or-less evenly distributed between design (thinking about how systems should be engineered/architectured/designed in order to be safe), manufacturing (RLHF, scalable oversight, etc.), V&V (evals, red teaming), and operations (monitoring). I discussed this in a little more detail here.

There is no pre-defined standard or set of characteristics that are checked, no real ability to consider application specific requirements, and no ability to specify where the system should not or must not be used.

No standard -- agree, "no ability to consider requirements" -- disagree, you can consider "requirements"[1]. "No ability to specify where the system should not or must not be used" -- agreed if you specifically use the verb "to specify", but we also can consider "where the system should not or must not be used".

Until we have some safety standard for machine learning models, they aren't "partly safe" or "assumed safe," or "good enough for consumer use." If we lack a standard for safety, ideally one where there is consensus that it is sufficient for a specific application, then exploration or verification of the safety of a machine learning model is meaningless.

You seem to try to import quite an outdated understanding of safety and reliability engineering. Standards are impossible in the case of AI, but they are also unnecessary, as was evidenced by the experience in various domains of engineering, where the presence of standards doesn't "save" one from drifting into failures.

So, I disagree that evals and red teaming in application to AI are "meaningless" because there are no standards.

However, I agree that we must actively invite state-of-the-art thinking about high reliability and resilience in engineering to the conversation about AI safety & resilience, AGI org design for reliability, etc. I'd love to see OpenAI, Anthropic, and Google hire top thinkers in this field as staff or consultants.

  1. ^

    Modern systems engineering methodology moves away from "requirements", i.e., deontic specification of what a system should/must do or not do, to more descriptive modality, i.e., use cases. This transition is already complete in modern software engineering, where classic engineering "requirements" are unheard of in the last 10 years. But other engineering fields (electrical, robotic, etc.) move in this direction, too.

AI safety "thought" is more-or-less evenly distributed

Agreed - I wasn't criticizing AI safety here, I was talking about the conceptual models that people outside of AI safety have - as was mentioned in several other comments. So my point was about what people outside of AI safety think about when talking about ML models, trying to correct a broken mental model.
 

So, I disagree that evals and red teaming in application to AI are "meaningless" because there are no standards. 

I did not say anything about evals and red teaming in application to AI, other than in comments where I said I think they are a great idea. And the fact that they are happening very clearly implies that there is some possibility that the models perform poorly, which, again, was the point.

You seem to try to import quite an outdated understanding of safety and reliability engineering. 

Perhaps it's outdated, but it is the understanding which engineers who I have spoken to who work on reliability and systems engineering actually have, and it matches research I did on resilience most of a decade ago, e.g. this. And I agree that there is discussion in both older and more recent journal articles about how some firms do things in various ways that might be an improvement, but it's not the standard. And even when doing agile systems engineering, use cases more often supplement or exist alongside requirements, they don't replace them. Though terminology in this domain is so far from standardized that you'd need to talk about a specific company, or even a specific project's process and definitions to have a more meaningful discussion.

Standards are impossible in the case of AI, but they are also unnecessary, as was evidenced by the experience in various domains of engineering, where the presence of standards doesn't "save" one from drifting into failures.

I don't disagree with the conclusion, but the logic here simply doesn't work to prove anything. It implies that standards are insufficient, not that they are not necessary.

Ok, in this passage:

In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly? I think this is a conceptual error that I want to address.

It seems that you put the first two sentences "in the mouth of people outside of AI safety", and they describe some conceptual error, while the third sentence is "yours". However, I don't understand what exactly is the error you are trying to correct because the first sentence is uncontroversial, and the second sentence is a question, so I don't understand what (erroneous) idea does it express. It's really unclear what you are trying to say here.

I did not say anything about evals and red teaming in application to AI, other than in comments where I said I think they are a great idea.

I don't understand how else to interpret the sentence from the post "If we lack a standard for safety, ideally one where there is consensus that it is sufficient for a specific application, then exploration or verification of the safety of a machine learning model is meaningless.", because to me, evals and red teaming are "exploration and verification of the safety of a machine learning model" (unless you want to say that the word "verification" cannot apply if there are no standards, but then just replace it with "checking"). So, again, I'm very confused about what you are trying to say :(

Perhaps it's outdated, but it is the understanding which engineers who I have spoken to who work on reliability and systems engineering actually have, and it matches research I did on resilience most of a decade ago, e.g. this.

My statement that you import an outdated view was based on that I understood that you declared "evals and red teaming meaningless in the absence of standards". If this is not your statement, there is no import of outdated understanding.

I mean, standards are useful. They are sort of like industry-wide, strictly imposed "checklists", and checklists do help with reliability overall. When checklists are introduced, the number of incidents goes down reliably. But, it's also recognised that it doesn't go down to zero, and the presence of a standard shouldn't reduce the vigilance of anyone involved, especially when we are dealing with such a high stakes thing as AI.

So, introducing standards of AI safety based on some evals and red teaming benchmarks would be good. While cultivating a shared recognition that these "standards" absolutely don't guarantee safety, and marketing, PR, GR, and CEOs shouldn't use the phrases like "our system is safe because it complies with the standard". Maybe to prevent abuse, this standard should be called something like "AI safety baseline standard". Also, it's important to recognise that the standard will mostly exist for the catching-up crowd of companies and orgs building AIs. Checking the most powerful and SoTA systems against the "standards" at the leading labs will be only a very small part of the safety and alignment engineering process that should lead to their release.

Do you agree with this? Which particular point from the two paragraphs above "people outside of AI safety" are confused about or don't realise?

I agree with this post. However, I think it's common amongst ML enthusiasts to eschew specification and defer to statistics on everything. (Or datapoints trying to capture an "I know it when I see it" "specification".)

That's true - and from what I can see, this emerges from the culture in academia. There, people are doing research, and the goal is to see if something can be done, or to see what happens if you try something new. That's fine for discovery, but it's insufficient for safety. And that's why certain types of research, ones that pose dangers to researchers or the public, have at least some degree of oversight which imposes safety requirements. ML does not, yet.

You're making the assumption that the safety methods for cars are appropriate to transfer directly to e.g. LLMs. That's not clearly true to me as there are strong differences in the nature of cars vs the nature of LLMs. For instance the purposes and capacities of cars are known in great detail (driving people from place to place), whereas the purposes of LLMs are not known (we just noticed that they could do a lot of neat things and assumed someone will find a use-case for them) and their capabilities are much broader and less clear.

I would be concerned that your proposed safety method would become very prone to Goodhearting.

I'm not saying that a standard is sufficient for safety, just that it's incoherent to talk about safety if you don't even have a clear idea of what would constitute unsafe. 

Also, I wasn't talking about cars in particular - every type of engineering, including software engineering, follows this type of procedure for verification and validation, when those are required. And I think metal rods are a better example to think about - we don't know what it is going to be used for when it is made, but whatever application the rod will be used for, it needs to have some clear standards and requirements.

I'm not saying that a standard is sufficient for safety, just that it's incoherent to talk about safety if you don't even have a clear idea of what would constitute unsafe.

I can believe it makes it less definitive and less useful, but I don't buy that it makes it "meaningless" and entirely "incoherent". People can in fact recognize some types of unsafety, and adversarially try to trigger unsafety. I would think that the easier it is to turn GPT into some aggressive powerful thing, the more likely ARC would have been to catch it, so ARCs failure to make GPT do dangerous stuff would seem to constitute Bayesian evidence that it is hard to make it do dangerous stuff.

Also, I wasn't talking about cars in particular - every type of engineering, including software engineering, follows this type of procedure for verification and validation, when those are required. And I think metal rods are a better example to think about - we don't know what it is going to be used for when it is made, but whatever application the rod will be used for, it needs to have some clear standards and requirements.

AFAIK rods are a sufficiently simple artifact that almost all of their behavior can be described using very little information, unlike cars and GPTs?

For the first point, if "people can in fact recognize some types of unsafety," then it's not the case that "you don't even have a clear idea of what would constitute unsafe." And as I said in another comment, I think this is trying to argue about standards, which is a necessity in practice for companies that want to release systems, but isn't what makes the central point, which is the title of the post, true.

And I agree that rods are often simple, and the reason that I chose rods as an example is because people have an intuitive understanding of some of the characteristics you care about. But the same conceptual model, however, applies to cars, where there is tons of specific safety testing with clearly defined standards, despite the fact that their behavior can be very, very complex.

For the first point, if "people can in fact recognize some types of unsafety," then it's not the case that "you don't even have a clear idea of what would constitute unsafe." And as I said in another comment, I think this is trying to argue about standards, which is a necessity in practice for companies that want to release systems, but isn't what makes the central point, which is the title of the post, true.

Maybe I am misunderstanding what you mean by "have a clear idea of what would constitute unsafe"?

Taking rods as an example, my understanding is that rods might be used to support some massive objects, and if the rods bend under the load then they might release the objects and cause harm. So the rods need to be strong enough to support the objects, and usually rods are sold with strength guarantees to achieve this.

"If it would fail under this specific load, then it is unsafe" is a clear idea of what would constitute unsafe. I don't think we have this clear of an idea for AI. We have some vague ideas of things that would be undesirable, but there tends to be a wide range of potential triggers and a wide range of potential outcomes, which seem more easily handled by some sort of adversarial setup than by writing down a clean logical description. But maybe when you say "clear idea", you don't necessarily mean a clean logical description, and also consider more vague descriptions to be relevant?

And I agree that rods are often simple, and the reason that I chose rods as an example is because people have an intuitive understanding of some of the characteristics you care about. But the same conceptual model, however, applies to cars, where there is tons of specific safety testing with clearly defined standards, despite the fact that their behavior can be very, very complex.

I already addressed cars and you said we should talk about rods. Then I addressed rods and you want to switch back to cars. Can you make up your mind?

"If it would fail under this specific load, then it is unsafe" is a clear idea of what would constitute unsafe. I don't think we have this clear of an idea for AI. 

 

Agreed. And so until we do, we can't claim they are safe.

But maybe when you say "clear idea", you don't necessarily mean a clean logical description, and also consider more vague descriptions to be relevant?

A vague description allows for a vague idea of safety. That's still far better than what we have now, so I'd be happier with that than the status quo - but in fact, what people outside of AI safety seem to mean by "safe" is even less specific than having an idea about what could go wrong - it's more often "I haven't been convinced that it's going to fail and hurt anyone."

I already addressed cars and you said we should talk about rods. Then I addressed rods and you want to switch back to cars. Can you make up your mind?

Both are examples. Both are examples, but useful for illustrating different things. Cars are far more complex, and less intuitive, but they still have clear safety standards for design.

I don't think it's true that the safety of a thing depends on an explicit standard. There's no explicit standard for whether a grizzly bear is safe. There are only guidelines about how best to interact with them, and information about how grizzly bears typically act. I don't think this implies that it's incoherent to talk about the situations in which a grizzly bear is safe.

Similarly, if I make a simple html web site "without a clear indication about what the system can safely be used for... verification that it passed a relevant standard, and clear instruction that it cannot be used elsewhere", I don't think that's sufficient for it to be considered unsafe.

Sometimes a thing will reliably cause serious harm to people who interact with it. It seems to me that this is sufficient for it to be called unsafe. Sometimes a thing will reliably cause no harm, and that seems sufficient for it to be considered safe. Knowledge of whether a thing is safe or not is a different question, and there are edge cases where a thing might occasionally cause minor harm. But I think the requirement you lay out is too stringent.

I think you're focusing on the idea of a standard, which is necessary for a production system or reliability in many senses, and should be demanded of AI companies - but it is not the fundamental issue with not being able to say in any sense what makes the system safe or unsafe, which was the fundamental point here that you seem not to disagree with.

I'm not laying out a requirement, I'm pointing out a logical necessity; if you don't know what something is or is not, you can't determine it. But if something "will reliably cause serious harm to people who interact with it," it sounds like you have a very clear understanding of how it would be unsafe, and a way to check whether that occurs.

Part of my point is that there is a difference between the fact of the matter and what we know. Some things are safe despite our ignorance, and some are unsafe despite our ignorance.

Sure, I agree with that, and so perhaps the title should have been "Systems that cannot be reasonably claimed to be unsafe in specific ways cannot be claimed to be safe in those ways, because what does that even mean?" 

If you say something is "qwrgz," I can't agree or disagree, I can only ask what you mean. If you say something is "safe," I generally assume you are making a claim about something you know. My problem is that people claim that something is safe, despite not having stated any idea about what they would call unsafe. But again, that seems fundamentally confused about what safety means for such systems.

I would agree more with your rephrased title.

People do actually have a somewhat-shared set of criteria in mind when they talk about whether a thing is safe, though, in a way that they (or at least I) don't when talking about its qwrgzness. e.g., if it kills 99% of life on earth over a ten year period, I'm pretty sure almost everyone would agree that it's unsafe. No further specification work is required. It doesn't seem fundamentally confused to refer to a thing as "unsafe" if you think it might do that.

I do think that some people are clearly talking about meanings of the word "safe" that aren't so clear-cut (e.g. Sam Altman saying GPT-4 is the safest model yet™️), and in those cases I agree that these statements are much closer to "meaningless".

I do think that some people are clearly talking about meanings of the word "safe" that aren't so clear-cut (e.g. Sam Altman saying GPT-4 is the safest model yet™️), and in those cases I agree that these statements are much closer to "meaningless".

 

The people in the world who actually build these models are doing the thing that I pointed out. That's the issue I was addressing.

People do actually have a somewhat-shared set of criteria in mind when they talk about whether a thing is safe, though, in a way that they (or at least I) don't when talking about its qwrgzness. e.g., if it kills 99% of life on earth over a ten year period, I'm pretty sure almost everyone would agree that it's unsafe. No further specification work is required. It doesn't seem fundamentally confused to refer to a thing as "unsafe" if you think it might do that.

I don't understand this distinction. If " I'm pretty sure almost everyone would agree that it's unsafe," that's an informal but concrete ability for the system to be unsafe, and it would not be confused to say something is unsafe if you think it could do that, nor to claim that it is safe if you have clear reason to believe it will not.

My problem is, as you mentioned, that people in the world of ML are not making that class of claim. They don't seem to ground their claims about safety in any conceptual model about what the risks or possible failures are whatsoever, and that does seem fundamentally confused.

That's true informally, and maybe it is what some consumers have in mind, but that is not what the people who are responsible for actual load-bearing safety are meaning.

The issue is that the standards are meant to help achieve systems that are safe in the informal sense. If they don't, they're bad standards. How can you talk about whether a standard is sufficient, if it's incoherent to discuss whether layperson-unsafe systems can pass it?

True, but the informal safety standard is "what doesn't harm humans." For construction, it amounts to "doesn't collapse," which you can break down into things like "strength of beam." But with AI you are talking to the full generality of language and communication and that effectively means: "All types of harm." Which is exactly the very difficult thing to get right here.  

For construction, it amounts to "doesn't collapse,"


No, the risk and safety models for construction go far, far beyond that, from radon and air quality to size and accessibility of fire exits. 

with AI you are talking to the full generality of language and communication and that effectively means: "All types of harm."

Yes, so it's a harder problem to claim that it's safe. But doing nothing, having no risk model at all, and claiming that there's no reason to think it's unsafe, so it is safe, is, as I said, "fundamentally confused about what safety means for such systems."

I get that, but I tried to phrase that in terms that connected to benwr's reques.

[+][comment deleted]-4-5