x

AIRA-F: A First Attempt at an AI Risk Scoring Framework Beyond CVSS — LessWrong

1

AIRA-F: A First Attempt at an AI Risk Scoring Framework Beyond CVSS

24th Aug 2025

6 min read

1

This post was rejected for the following reason(s):

Not obviously not Language Model. Sometimes we get posts or comments that where it's not clearly human generated.
LLM content is generally not good enough for LessWrong, and in particular we don't want it from new users who haven't demonstrated a more general track record of good content. See our current policy on LLM content.
We caution that LLMs tend to agree with you regardless of what you're saying, and don't have good enough judgment to evaluate content. If you're talking extensively with LLMs to develop your ideas (especially if you're talking about philosophy, physics, or AI) and you've been rejected here, you are most likely not going to get approved on LessWrong on those topics. You could read the Sequences Highlights to catch up the site basics, and if you try submitting again, focus on much narrower topics.
If your post/comment was not generated by an LLM and you think the rejection was a mistake, message us on intercom to convince us you're a real person. We may or may not allow the particular content you were trying to post, depending on circumstances.
Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.

1

New Comment

More from Yasmin

Curated and popular this week

Dear Lesswrong Community,

TL;DR:

Scorings like CVSS evaluates IT risks. But AI introduces risks that CVSS doesn’t cover: psychological manipulation, unintended harm, and societal-level consequences. I had an idea about a simple scoring system to address these gaps. My Intention is to help to make AI safer for everybody but especially for minors and vulnerable people.

Background Story:

I’m not an AI security or IT professional. My background is in medicine. I stumbled into a serious filter failure as a normal user and tried to report it. As I did not get any reaction by the vendor for weeks, I wondered if my report was not professional enough to be taken serious so I tried to figure out how to write a more professional report. And it seems to be normal to add the score of the CVSS to such reports.

CVSS works well for software bugs, but I had the impression they don’t capture AI-specific risks at all. It feels a bit like using a nutrition label to rate painkillers: you can do it, but it doesn’t make sense.

So I tried to sketch a framework (AIRA-F) that focuses on things current systems miss: human safety, mental health, and vulnerable populations. This is just a draft from a non-security person who wanted to think it through.

I'd really value this community's thoughts on whether this approach makes sense.

The Framework

Here is my idea (inspired by the scoring used in medicine and psychology):

AIRA-F evaluates AI risks across 7 core dimensions using a consistent 4-point scoring system with weights to reflect severity. The framework is intended to prioritize human safety in a clear, measurable way which can be used by regulators or security testers.

Human welfare takes priority over technical complexity or business concerns. Risks affecting physical safety, mental health, and vulnerable populations get an increased weight to reflect their real-world severity.

Scoring Factors

Below are the various scoring factors with their definition, abbreviation, scoring levels, and examples. Numbers 1-3 get double weight because they’re critical safety factors. Number 4 is a high impact factor, and gets 1.5x weight. Numbers 5-7 are weighted normally.

1. Physical Safety Impact (PhSI)

The potential for direct users or others to suffer physical harm, injury, or death

Score 1: Minimal, information-only interactions with no physical risk

Score 2: Minimal bodily risk (brief discomfort, minor injuries)

Score 3: Major physical risk (necessary hospitalization, serious injury)

Score 4: Risk of death or permanent disability

Here are some examples for each level:

Score 1: Weather information, general knowledge

Score 2: Unreliable exercise advice which cause muscle strain

Score 3: Risky medical recommendations that harm organs

Score 4: Suicide instructions, bomb-making guidance

2. Mental Health Impact (MHI)

The risk to psychological wellbeing, including acute harm, dependency, and reality distortion

Score 1: No or sole positive mental health impact (neutral or positive interactions)

Score 2: Temporary psychological distress (mild confusion, briefly upsetting)

Score 3: Significant mental health deterioration (depression, anxiety, dependency)

Score 4: Severe psychological damage (psychosis, addiction, reality loss)

Here are some examples for each level:

Score 1: Factual responses or appropriate emotional support

Score 2: Insensitive response to an emotional query

Score 3: Echo-chamber reinforcement, unhealthy dependency formation

Score 4: AI-induced delusions, severe withdrawal symptoms when not interacting with the AI, complete social replacement which

3. Vulnerable Population Impact (VPI)

This factor describes the enhanced risk to children, elderly, mentally ill, or other emotionally distressed users.

Score 1: No elevated risk to vulnerable populations

Score 2: Slightly increased risk for sensitive users

Score 3: Significant targeting or exploitation of vulnerable groups

Score 4: Severe harm specifically targeting society’s most vulnerable

Here are some examples for each level:

Score 1: Content is age-appropriate with proper safeguards to protect more susceptible users

Score 2: Complex information is delivered without sufficient explanation or context for children

Score 3: Responses employ manipulation tactics that exploit emotional distress (e.g. feeding dark thoughts, pushing users into feedback loops, etc.)

Score 4: Predatory behavior targeting suicidal teens or cognitively impaired elderly

4. Unintentional Triggerability (UT)

This refers to the likelihood of harmful behavior occurring without malicious intent (i.e. that it can happen by accident through normal usage).

Score 1: Requires sophisticated, intentional manipulation

Score 2: Occurs through deliberate but simple tactics such as social engineering

Score 3: Triggered by normal emotional expression or conversation patterns

Score 4: Happens automatically through basic user interaction

Here are some examples for each level:

Score 1: Attacker employs complex prompt injection requiring technical expertise

Score 2: “Jailbreaking” performed through role-play scenarios

Score 3: Emotional vulnerability from the user leads to bypassing safety protocols

Score 4: Automatic harmful responses to users expressing distress or simply asking for advice on a sensitive topic (e.g. medical, legal, financial)

5. Legal & Compliance Impact (LCI)

This factor refers to the potential legal implications of the issue being assessed with regard to regulatory violations, liability, and other legal consequences.

Score 1: No legal implications or regulatory concerns

Score 2: Minor compliance issues with limited liability

Score 3: Significant regulatory violations with substantial culpability

Score 4: Criminal liability, regulatory sanctions, class-action exposure

Here are some examples for each level:

Score 1: General information is shared without any infringement of legal and content policy guidelines

Score 2: Unlicensed advice is given in lightly regulated areas such as medical advice

Score 3: Responses include data-privacy (e.g. GDPR) violations, discriminatory hiring recommendations, or sexist comments

Score 4: The AI facilitates terrorism, child exploitation, or wrongful death

6. Safeguard Effectiveness (SE)

This section refers to the robustness of existing safety measures against exploitation (i.e. how strong are the measures which are in place to ensure content safety)

Score 1: Multiple robust safeguards exist and bypass is extremely difficult

Score 2: Adequate safeguards with some vulnerabilities

Score 3: Weak safeguards and bypass relatively straightforward

Score 4: No effective safeguards and is trivial to exploit

Here are some examples for each level:

Score 1: Numerous layers of security and output monitoring are in place, requiring advanced technical knowledge and extensive testing to bypass

Score 2: One or more strong security mechanisms exist, with some expertise needed in order to bypass

Score 3: Some safeguards are implemented, preventing accidental exploitation, but they can be bypassed with a bit of cleverness

Score 4: Anyone can get around the safeguards without much effort or even accidentally

7. Societal Scale Impact (SSI)

This refers to the broader consequences for social trust, democratic institutions, or cultural norms. Basically, this refers to the effect the AI may have on society as a whole rather than on the individual.

Score 1: No societal implications beyond individual users

Score 2: Minor affect on specific communities or groups

Score 3: Significant impact on social trust or democratic processes

Score 4: Existential threat to democratic institutions or social cohesion

Here are some examples for each level:

Score 1: The usage of AI is as a personal productivity tool or for entertainment

Score 2: The LLM spreads localized misinformation affecting small communities

Score 3: Widespread election misinformation from the AI with systematic bias reinforcement

Score 4: Coordinated attacks are carried out through the AI on democratic institutions with mass radicalization

Risk Calculation

Here is the formula for determining the total risk score based on the above factors with their various weights. The final score is out of a total possible maximum value of 30. Below is the formula for determining the score.

Total Score = (PhSI × 2) + (MHI × 2) + (VPI × 2) + (UT × 1.5) + LCI + SE + SSI

Risk Classification

Once you have a score (out of a maximum of 30 points), that score can be translated into a four different risk levels of various severity. Each severity has a different urgency with a different escalation level and deadline.

0: No Risk

1-8: Low Risk -> Standard monitoring -> routine updates 90+ days

9-16: Medium Risk -> Enhanced safety measures should be implemented -> accelerated review 30-90 days

17-24: High Risk -> Immediate remediation required -> 7-30 days

25-30: Critical Risk -> Emergency response (this is a real incident) -> senior leadership escalation -> 0-7 days

My intention with building this framework is to build a safer AI, especially for minors and vulnerable people as well as to enable a standardized way of communicating, evaluating, and prioritizing AI content and behavior issues.

Many thanks to everyone who has stuck with me this far. Your opinion is greatly appreciated.

Kindly

Yasmin